# Project_CodeNet

**Repository Path**: helwen/Project_CodeNet

## Basic Information

- **Project Name**: Project_CodeNet
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2021-06-18
- **Last Updated**: 2021-06-18

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Project CodeNet

[![DOI](https://zenodo.org/badge/363800912.svg)](https://zenodo.org/badge/latestdoi/363800912)

The goal of Project CodeNet is to provide the *AI-for-Code* research community with a large scale, diverse, and high quality curated dataset to drive innovation in AI techniques. 

## Table of Contents

   * [Introduction](#introduction)
      * [Differentiation](#differentiation)
      * [Benchmarks](#benchmarks)
      * [Potential use cases](#potential-use-cases)
      * [Usability](#usability)
   * [Models and experiments](#models-and-experiments)
   * [Relevant links](#relevant-links)
   * [Download the dataset](#download-the-dataset)
   * [Dataset overview](#dataset-overview)
      * [Dataset statistics](#dataset-statistics)
   * [Data](#data)
   * [Metadata](#metadata)
      * [Metadata at the dataset level](#metadata-at-the-dataset-level)
      * [Metadata at the problem level](#metadata-at-the-problem-level)
   * [Directory structure and naming convention](#directory-structure-and-naming-convention)
   * [Relationships among the metadata and data](#relationships-among-the-metadata-and-data)
      * [Example of getting the source file for a particular submission](#example-of-getting-the-source-file-for-a-particular-submission)
      * [Example of getting the metadata for a particular source file](#example-of-getting-the-metadata-for-a-particular-source-file)
   * [Tools to process source files](#tools-to-process-source-files)
      * [Statistics](#statistics)
      * [Access and selection](#access-and-selection)
      * [Pre-processing](#pre-processing)
   * [Contributors](#contributors)

## Introduction

A decade ago, Marc Andreessen [famously wrote](https://a16z.com/2011/08/20/why-software-is-eating-the-world/) that "software is eating the world." Software now permeates every part of our existence; Google services combine for [2 billion lines of code](https://www.wired.com/2015/09/google-2-billion-lines-codeand-one-place/), and a modern vehicle [contains around](https://www.technologyreview.com/2012/12/03/181350/many-cars-have-a-hundred-million-lines-of-code/) 100 million lines of code. It's a monumental challenge to create, debug, maintain, and update these complex software systems. Recently, a fast-growing discipline known as AI for Code aims to help software developers improve their productivity by automating the software engineering process. AI for Code researchers have been leveraging technologies like NLP and augmenting them with code analysis and compilation techniques to perform a myriad of practical tasks, such as code search, summarization, and completion, as well as code-to-code translation. The discipline isn't limited to academic research either: Ruchir Puri, IBM Research's chief research scientist, discussed in a recent [podcast](https://open.spotify.com/episode/7gHPbVBHEgSdrACTow7Gql) how technologies from AI for Code are being used to modernize legacy software by helping migrate monolithic applications to microservices for IBM's enterprise clients.

AI for Code is poised to transition from proof-of-concept to widespread adoption. To provide a catalyst for such a tipping point, researchers at IBM Research have introduced Project CodeNet, a large-scale dataset for benchmarking and experimentation. Project CodeNet has many characteristics (large scale, diveristy, etc.) similar to ImageNet, a huge dataset for imagery that had a dramatic impact on the field of computer vision research. Project CodeNet is a large scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. Project CodeNet aims to do for AI for Code what ImageNet did for computer vision.

### Differentiation

There are a few differentiating features of Project CodeNet when compared to other similar efforts. In addition to the size of the dataset, the code samples are written in over 50 programming languages, though the dominant languages are C++, C, Python, and Java. The code samples in Project CodeNet are annotated with a rich set of information, such as the code size, memory footprint, CPU run time, and status, which indicates acceptance or error types. Over 90% of the problems come with the respective problem description, which contains a concise problem statement, specification of the input format and the output format. When available, we also extracted from the problem description sample input and output, and provide them as part of the dataset. Users can execute the accepted codes samples (over 50% of the submissions are accepted) to extract additional metadata and verify outputs from generative AI models for correctness.

Another area that Project CodeNet addressed is the quality of the data samples. From a [paper](https://arxiv.org/pdf/1812.06469.pdf) by Allamanis, we learned that quite a large number of frequently used AI for Code datasets have duplicate or near-duplicate code samples, which can inflate performance metrics as much as 100%. In addition, we found that problem-submission style datasets from online judging systems can contain clusters of identical problems, which will certainly skew the performance metrics. One example is [POJ-104](https://sites.google.com/site/treebasedcnn/), in which problems 26 and 62 are identical. Therefore we identified the near-duplicates and the identical problem clusters in Project CodeNet and provide these information for the benefit of the users.

### Benchmarks

In light of these issues, we have extracted several benchmark datasets from CodeNet for users to perform code classification and code similarity experiments. They have been filtered to remove identical problem clusters and near-duplicate code samples, so that performance metrics can be measured on training and test data samples with the appropriate statistics. There are two C++ benchmark datasets that are similar to the popular POJ-104 but are approximately ten times in size. We felt that the size increase is necessary, since [98% accuracy](https://github.com/zhangj111/astnn) has been already achieved in code classification on POJ-104. An order of magnitude larger dataset will leave ample room to advance the state of the art with more complex neural networks and algorithms. The other two benchmark datasets are in Python and Java, which provides a different flavor because the frequent use of library functions.

### Potential use cases

The rich metadata and diversity open Project CodeNet to a plethora of uses cases. The problem-submission relationship in Project CodeNet corresponds to type-4 similarity and can be used for code search and clone detection. The code samples in Project CodeNet are labeled with their acceptance status and we can explore AI techniques to distinguish correct codes from problematic ones. Project CodeNet's metadata also enables the tracking of how a submission evolves from problematic to accepted, which could be used for exploring automatic code correction. Each code sample is labeled with CPU run time and  memory footprint, which can be used for regression studies and prediction. Given  its wealth of programs written in a multitude of languages, Project CodeNet may serve as a valuable benchmark dataset for source-to-source translation.

### Usability

To facilitate creation of customized benchmarks and dataset, we provide a set of productivity tools to aggregate codes samples based on user criteria. We are also releasing pre-processing tools to transform code samples into [token sequences](tools/tokenizer), [simplified parse trees](tools/spt-generator) and other [code graphs](tools/analysis-graph-generator).

## Models and experiments

We have performed numerous experiments on the CodeNet dataset. The goal of these experiments is to produce a set of baseline models and results for users of the CodeNet dataset to gauge their research. The run scripts and training scripts are available in the model-experiments directory. The classification and similarity experiments use the benchmark datasets we extracted from CodeNet as training and test datasets. In addition to experiments based on token sequences, we also have experiments leveraging graph neural networks (GNN). For the convenience of the users interested in GNN's, we have included the simplified parse tree (SPT) representation of the code samples for each benchmark dataset. The experiment on Masked Language Model has a companion Jupyter notebook in the notebooks directory.

## Relevant links

- [Project CodeNet full dataset: Project_CodeNet.tar.gz](
https://dax-cdn.cdn.appdomain.cloud/dax-project-codenet/1.0.0/Project_CodeNet.tar.gz)
- [Project CodeNet metadata: Project_CodeNet_metadata.tar.gz](https://dax-cdn.cdn.appdomain.cloud/dax-project-codenet/1.0.0/Project_CodeNet_metadata.tar.gz)
- [Project CodeNet paper: ProjectCodeNet_NeurIPS2021.pdf](./ProjectCodeNet_NeurIPS2021.pdf)

## Download the dataset

Download the full dataset in our [data repository]
(https://developer.ibm.com/technologies/artificial-intelligence/data/project-codenet/).

`tar -zxf Project_CodeNet_full.tar.gz`
to uncompress and untar. The directory structure and how the code samples are organized are explained [here](README.md#directory-structure-and-naming-convention).

The 4 benchmark datasets, Project_CodeNet_C++1000, Project_CodeNet_C++1400,
Project_CodeNet_Python800, and Project_CodeNet_Java250 are included in the
full dataset and are available separately in the "Archive Dataset File" column of the table in the "Get this Dataset" 
section in our [data repository](https://developer.ibm.com/technologies/artificial-intelligence/data/project-codenet/). 
They can be used for code classification and code similarity research as a replacement of or in addition to the dataset [POJ-104](https://sites.google.com/site/treebasedcnn/).

To expedite AI for code research using graph neural networks, we have included the simplified parse tree (SPT) representation of the code samples for each benchmark dataset. They are available in the "Archive SPT File" column of the table in the "Get this Dataset" section in our [data repository](https://developer.ibm.com/technologies/artificial-intelligence/data/project-codenet/).

## Dataset overview

The Project CodeNet Dataset consists of a very large collection of source files, extensive metadata, tooling to access the dataset and make tailored selections, and documentation.

The basis of the dataset is the data available on two online judge web sites:

1. [AIZU Online Judge](https://onlinejudge.u-aizu.ac.jp/home)
2. [AtCoder](https://atcoder.jp/)

An online judge website offers programmers an opportunity to test their skills by posing programming problems in the form of courses or contests. Users may submit their solution which is then judged by an automatic review mechanism. The outcome is reported back to the user. Both problem descriptions, user submissions and associated metadata are available for study via various REST APIs.

The first step in constructing Project CodeNet is downloading the problem descriptions and the source code submissions from the websites mentioned above, followed by reshaping and consolidating the metadata and cleaning up the inconsistencies, omissions, and
mistakes in the source data itself.

### Dataset statistics

The dataset comprises 13,916,868 submissions, divided into 4053 problems (of which 5 are empty). Of the submissions 53.6% (7,460,588) are *accepted*, 29.5% are marked as *wrong answer* and the remaining suffer from one of the possible rejection causes. The data contains submissions in 55 different languages, although 95% of them are coded in the six most common languages (C++, Python, Java, C, Ruby, C#). C++ is the most common language with 8,008,527 submissions (57% of the total) of which 4,353,049 are *accepted*. Here are 2 pie charts depicting submissions and status distribution of Project CodeNet.

<table><tr>
<td> <img src="./assets/Project_CodeNet_subs.png" alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="./assets/Project_CodeNet_status.png" alt="Drawing" style="width: 250px;"/> </td>
</tr></table>

A detailed overview of the dataset statistics can be found in this [spreadsheet](assets/Project_CodeNet_statistics.xlsx).

## Data

The data consist of complete programs in a particular programming language. Each program is contained in a single file. The file will have a name with an extension that denotes the programming language used. (More details about the specific programming language and the version of the compiler/interpreter used, can be found in the metadata.)

Each program attempts to solve a certain programming task or problem. There are many problems and each problem might have many solutions in different languages. We refer to each program as a submission instead of a solution since it might not be complete and correct.  Solutions are the accepted submissions that are compilable and executable, and at least correctly produce the expected results on all provided test cases. (Of course, according to the late Dijkstra, tests are no proof of correctness.)

## Metadata

The metadata provides properties of interest about the problems and their submissions. Foremost it formalizes the organization of the data and the relationship between problems, languages, and the source code files. The metadata allows for queries about the data and to make specific selections among the large collection of problems, languages, and source files.

Metadata is made available in comma-separated value (CSV) files. This allows for easy processing, even with simple command-line tools. Some of the fields in the csv files might be empty, and for submissions that are not accepted, some fields might have invalid entries such as negative numbers for CPU time. Extra checking needs to be implemented in parsing these files.

The metadata is hierarchically organized on 2 levels: the first level is the dataset level that relates to all the different problems defined by the various dataset sources. The second level is the problem level that relates to all source code submissions pertaining to a single problem or task.

Metadata and data are deliberately kept fully separated within the file system.

### Metadata at the dataset level

At the dataset level there is a single CSV file (`problem_list.csv`) listing all the different problems. Additionally, for each problem there is a more extensive description that sets the problem and any further requirements and constraints and often provides examples of data input and expected output.

The fields and their format of this CSV file are captured by the following table:

name of column | data type | unit | description
-- | -- | -- | --
id           | string | none | unique anonymized id of the problem
name         | string | none | short name of the problem
dataset      | string | none | original dataset, AIZU or AtCoder
time_limit   | int    | millisecond | maximum time allowed for a submission
memory_limit | int    | KB | maximum memory allowed for a submission
rating       | int    | none | rating, i.e., difficulty of the problem
tags         | string | none | list of tags separated by "\|"; not used
complexity   | string | none | degree of difficulty of the problem; not used

### Metadata at the problem level

At the problem level there is a CSV file per problem and all content of these files is of course organized under one and the same header.

The fields and their format of this CSV file are captured by the following table:

name of column | data type | unit | description
-- | -- | -- | --
submission_id | string | none | unique anonymized id of the submission
problem_id    | string | none | anonymized id of the problem
user_id       | string | none | anonymized user id of the submission
date          | int    | seconds | date and time of submission in the Unix timestamp format (seconds since the epoch)
language      | string | none | mapped language of the submission (ex: C++14 -> C++)
original_language | string | none | original language specification
filename_ext  | string | none | extension of the filename that indicates the programming language used
status        | string | none | acceptance status, or error type
cpu_time      | int    | millisecond | execution time
memory        | int    | KB | memory used
code_size     | int    | bytes | size of the submission source code in bytes
accuracy      | string | none | number of tests passed; *Only for AIZU

Here is a table of all the possible status values. The “abbreviation” and “numeric code” are sometimes seen in the original metadata on the websites; it is listed here for reference and completeness. These fields do not occur in the Project CodeNet metadata.

status | abbreviation | numeric code
-- | -- | --
Compile Error          | CE  |  0
Wrong Answer           | WA  |  1
Time Limit Exceeded    | TLE |  2
Memory Limit Exceeded  | MLE |  3
Accepted               | AC  |  4
Judge Not Available    | JNA |  5
Output Limit Exceeded  | OLE |  6
Runtime Error          | RE  |  7
WA: Presentation Error | PE  |  8
Waiting for Judging    | WJ  | 
Waiting for Re-judging | WR  | 
Internal Error         | IE  | 
Judge System Error     |     | 

## Directory structure and naming convention

The data and metadata are organized in a rigorous directory structure. At the top level sits the `Project CodeNet` directory with several sub-directories, `data`, `metadata`, and `problem_descriptions`:

- `data` is further subdivided into a directory per problem and within each problem directory, directories for each language. The language directory contains all the source files supposed to be written in that particular programming or scripting language. When there are no submissions for a particular language, there will be no directory for it, but the problem directory will always be there, even if there are no submissions at all.

    The name of the directory for a programming language is the common name for the language using proper capitalization and special characters. This name is the consolidation of the names used in the metadata. Information is available about how the original language designations are mapped into the directory names and how these more general and common names are mapped to the submission file name extensions. As an example, a source could be designated c++14, which is mapped into the directory `C++` (notice the capital C) and will get the extension `.cpp`.
- `derived` holds information about near-duplicates, identical problem clusters, sample input and output for each problem, as well as the benchmarks. 
- `metadata` holds all the problem CSV files and the `problem_list.csv` file.
- `problem_descriptions` holds HTML files for most problems, giving an extensive description of the problem, often accompanied with some sample input and expected output.

For the sake of creating a uniform set of metadata across all data sources, and to hide any sensitive information, some metadata fields are anonymized by randomly (but uniquely and consistently) renumbering problem, submission, and user identifiers (ids). The identifiers we use are defined by simple regular expressions:

- problem ids are anonymized and follow this pattern: `p[0-9]{5}` (a `p` followed by exactly 5 digits).
- submission ids are anonymized and follow this pattern: `s[0-9]{9}` (an `s` followed by exactly 9 digits).
- user ids are anonymized and follow this pattern: `u[0-9]{9}` (a `u` followed by exactly 9 digits).

## Relationships among the metadata and data

The main relationship between problem metadata and data is the fact that each metadata record (a non-header row in a problem CSV file) describes one source file and provides all information about its location. The directory structure and naming convention as stated above are implicitly assumed.

### Example of getting the source file for a particular submission

Starting at a CSV metadata entry for a particular submission, here is how to get to the corresponding source file. Say that the submission id is `s300682070`. Either we know this is a submission to problem `p00001` upfront or we can grep through all `Project_CodeNet/metadata/p?????.csv` files to learn that. We get a brief description of this problem by looking at the `p00001` entry in the `Project_CodeNet/metadata/problem_list.csv`:

```console
p00001,List of Top 3 Hills,AIZU,1000,131072,,,
```

We can get a more verbose description of this problem by reading `Project_CodeNet/problem_descriptions/p00001.html`.

The `Project_CodeNet/metadata/p00001.csv` file provides the info on all submissions. For our selected submission we find:

```console
s300682070,p00001,u558442027,1480319506,JavaScript,JavaScript,js,Accepted,60,15496,219,4/4
```

We see it is an `Accepted` submission in the language `JavaScript` with file extension `.js`.

The source file path therefore is: `Project_CodeNet/data/p00001/JavaScript/s300682070.js`

### Example of getting the metadata for a particular source file

Likewise, we can play the reverse game of finding the metadata entry for a given submission source file. Say the source file is `Project_CodeNet/data/p00001/JavaScript/s300682070.js`.

Encoded in this file name path we see the problem id `p00001` and language `JavaScript` and of course the submission id `s300682070`. We find the metadata CSV file to be: `Project_CodeNet/metadata/p00001.csv`. Opening that file and searching for the submission id we find the entry:

```console
s300682070,p00001,u558442027,1480319506,JavaScript,JavaScript,js,Accepted,60,15496,219,4/4
```

## Tools to process source files

The source files of Project CodeNet represent examples of some 50+ different programming and scripting languages. Of course not all languages are equally represented: most submissions are written in the more popular languages C, C++, Java, and Python.

To complement our large dataset of source code, a suite of tools and utilities will be provided. These tools target several purposes:

- derive statistics from the dataset
- access the dataset files to make selections
- preprocess the source files to extract certain information
- facilitate conversions between popular formats

### Statistics

Since Project CodeNet uses the file system as storage and uses a rigorous directory structure, many (Linux) command-line utilities can be directly used to extract interesting statistics about the dataset. Utilities like `ls`, `wc` and `grep` are very useful. The CSV metadata can best be browsed using [`csvkit`](https://csvkit.readthedocs.io/en/latest/) components like `csvstat`.

More elaborate statistics about the dataset can easily be retrieved using SQL queries on a database representation of the metadata. [HSQLDB](http://hsqldb.org/) is a database that runs off a CSV file. Our CSV problem metadata files are simply stripped of their headers and concatenated. A suite of useful SQL queries is available. A separate [document](doc/HSQLDB.md) explains the necessary steps.

### Access and selection

As described above, it should be easy to create specific subsets of the
dataset merely by copying (or symlinking) relevant files and/or
directories. For more elaborate selections based on a subset or range of
problems, a subset of languages, statuses, and code sizes, several Bash
scripts are available to accomplish that. These scripts reside in the
`tools/aggregation-scripts` directory and are separately documented in this [README](tools/aggregation-scripts/README.md).

### Pre-processing

We provide tools to convert code samples into a representation that can be consumed by AI algorithms
  - generate stream of tokens [tokenizer](tools/tokenizer)
  - parsing to tree/abstract syntax tree [AST generation](tools/spt-generator) 
  - control and data flow graph construction [code analysis](tools/analysis-graph-generator) 

Whether and to what extent the above steps can successfully be applied to any given source file depends on several factors. Obviously, if the submission is not of `Accepted` status, it is to be expected that even simple tokenization will fail because of malformed lexical elements. But the situation for `Accepted` submissions is not always better: programmers might have used certain non-standard features of the language that happen to be accepted by a certain compiler or interpreter. Simple cases are the use of a dollar sign as part of a C identifier. For languages like C and C++ that use a pre-processor, use of macros and conditional defines can hugely change how the code ultimately looks like.

### Contributors

Ruchir Puri, David Kung, Geert Janssen, Giacomo Domeniconi, Vladimir Zolotov, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Ulrich Finkler.


近日，IBM 研究院发布了一个名为 CodeNet 的数据集，该数据集包含 1400 万个代码样本，用于训练面向编程任务的机器学习模型。该数据集的主要特点包括：

迄今为止最大的编码数据集，其中包含 4000 个问题，1400 万个代码样本，50 + 种编程语言；
该数据集添加了注释，包括问题描述、内存 / 时间限制、语言、代码通过 / error 等。

IBM 希望 CodeNet 仿效大型图像数据集 ImageNet，并成为教软件理解软件开发蓝图的领先数据集。IBM 希望 CodeNet 可以用于训练具有如下功能的开发工具：

从一种编程语言转换到另一种编程语言；
代码推荐与补全；
代码优化；
搜索应用程序和库来源以查找所需例程；
将一种语言转换成另一种语言；
识别错误 / 正确的实现机制。

利用深度学习进行自动化编程

近年来，机器学习领域取得了令人瞩目的进步，AI 让多种工作任务实现了自动化，包括编程。但是 AI 在软件开发中的渗透却遇到了极大的困难。

人们在编程时通常会使用大量的有意识和潜意识思维机制发现新的问题并探索不同的解决方案。相比之下，大多数机器学习算法都需要定义明确的问题和大量带有注释的数据才能够开发出解决相同编程问题的模型。

为了解决这一难题，研究者与开发者们已经做出了很多努力，包括创建数据集和基准，以开发和评估「用于编程的 AI」系统。但是，鉴于软件开发的创造性和开放性，很难为编程创建完美的数据集。

图片

IBM 的研究人员试图创建一个多用途的数据集，可用于训练各种任务的机器学习模型。CodeNet 的创建者将其描述为「非常大规模，多样且高质量的数据集，能够加快使用 AI 编程的步伐」。该数据集包含 1400 万个代码样本，共有用 55 种编程语言编写的 5 亿行代码，其中 C++ 是样本中使用最多的语言，Python 位居第二。这些代码样本是从提交给在线编程平台 AIZU 和 AtCoder 上的近 4,000 项挑战的提交中获得的，代码样本包括这些挑战的正确答案和错误答案。

图片

CodeNet 项目地址：https://github.com/IBM/Project_CodeNet

CodeNet 的主要特点之一是代码样本中添加了注释。数据集中包含的每个编程挑战都有一个文本说明以及 CPU 时间和内存限制。每个代码提交都包含十几条信息，包括语言，提交日期，内存占用大小，执行时间，接受和 error 类型。为了确保该数据集在编程语言，接受和 error 类型等多个维度上保持平衡，IBM 的研究人员付出了巨大的努力。

机器学习编程任务

CodeNet 并不是训练机器学习模型来执行编程任务的唯一数据集。相比于其他数据集，CodeNet 具有以下特点：首先是数据集的规模，包括样本数量和语言的多样性；但更重要的是编码样本附带的元数据。CodeNet 中添加的丰富注释使其能够适用于多种任务，不再只是用于特定编程任务。

使用 CodeNet 开发用于编程任务的机器学习模型包括以下方式：

CodeNet 可以用来进行语言翻译任务。由于数据集中包含的每个编程挑战都包含不同编程语言的提交，因此数据科学家们可以用它来创建机器学习模型，将代码从一种语言转换成另一种语言。对于希望将旧代码移植成新语言、使新一代程序员能够访问并使用新型开发工具进行维护的人们而言，这可能很方便；
CodeNet 还可以用来开发完成代码推荐任务的机器学习模型开发。推荐工具既可以像完成当前代码行的自动完成样式模型一样简单，也可以是编写完整函数或代码块的更复杂系统。

图片

由于 CodeNet 拥有大量关于内存和执行时间指标的元数据，数据科学家也可以使用它来开发代码优化系统。或者，可以使用 error 类型的元数据来训练机器学习系统，以标记源代码中的潜在缺陷。

CodeNet 更高级的用例是代码生成。CodeNet 是一个丰富的问题文本描述库，并包含对应的源代码。已经有开发人员使用高级语言模型（如 GPT-3）从自然语言描述生成代码，CodeNet 或许能够帮助微调这些语言模型，使其在代码生成中更加一致。

IBM 的研究人员已经对 CodeNet 进行了一些实验，这些实验包括代码分类、代码相似性评估和代码补全。使用的深度学习体系架构包括简单的多层感知器、卷积神经网络、图神经网络、Transformer。

IBM 和 MIT-IBM Watson AI 实验室团队联合开发了该数据集，研究中的实验结果显示大多数任务都能获得90%以上的准确率。

图片

论文地址：https://github.com/IBM/Project_CodeNet/blob/main/ProjectCodeNet.pdf

建立高效的机器学习系统，需付出巨大努力

IBM 的工程师们进行了大量的工作来管理 CodeNet 数据集并开发其辅助工具。

首先，研究团队需要从 AIZU 和 AtCoder 收集代码样本。二者中只有一个平台有应用程序接口（API），可以很容易地获取代码，而另一个平台没有易于访问的接口，研究团队需要开发新工具，从平台的网页上抓取数据，并将其分解成表格格式。然后研究者们需要手动将两个数据集合并到一个统一的模式中。

接下来，研究团队需要开发用于识别和删除重复代码和样本（包含大量无效代码，运行时未执行的源代码）的工具，以清除无用数据。

此外，该研究团队还开发了预处理工具，使得在 CodeNet 语料库上训练机器学习模型变得更容易，包括用于不同编程语言的 tokenizer、分析树（parse tree）和用于图神经网络的图表征生成器。

所有这些都提醒我们，要创建高效的机器学习系统，需要付出巨大的努力。人工智能要取代程序员还有很长的路要走。

参考链接：
https://bdtechtalks.com/2021/05/17/ibms-codenet-machine-learning-programming/
https://news.51cto.com/art/202105/662376.htm