# Large-Language-Models-for-Code

**Repository Path**: shan-yujun/Large-Language-Models-for-Code

## Basic Information

- **Project Name**: Large-Language-Models-for-Code
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-04-17
- **Last Updated**: 2024-04-17

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Large Language Models(LLMs) for Code

A collection of LLMs of Code.

**Continuous update.:running:**

If there are any **errors** or some **missing models**, please contact us [new issue](https://github.com/wanghanbinpanda/Large-Language-Models-for-Code/issues) or by e-mail:wanghanbin95@163.com.

## Contents

- [Code LLMs](#Code-LLMS)

- [Timeline of Code LLMs](#Timeline-of-Code-LLMs)

- [Parameters of Code LLMs](#Parameters-of-Code-LLMs)

- [Models](#Models)

- [Improve Code LLMs](#Improve-Code-LLMs)

- [Dataset](#Dataset)

- [Benchmark](#Benchmark)

- [Future](#Future)

  

## Code LLMs

Large Language Models(LLMs) for Code.

The Code LLMs referred to here is a large language model specially trained for code-related tasks. Its training corpus may not only contain code, but also natural language. 

The Code LLMs here does not include the general large language model, although the general large language model also has the ability to complete code-related tasks.



## Timeline of Code LLMs

![image-20230424054009632](assets/image-20230424054009632.png)



## Parameters of Code LLMs

![LLMS](assets/LLMS.png)



## Models

- Codex [OpenAI] [2021.07] [Close]

  :page_with_curl:[Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374)

  :black_flag:[introduction](./Model/1-CodeX.md)

  ```yaml
  Model Architecture: Decoder Only,GPT Family
  Params: 12B
  Training Data: Collected[Code:159GB]
  Training Time: -
  Languages: Python[Multilingual]
  Evaluation: HumanEval, APPS
  Supported Tasks: Code Generation，Docstring Generation
  ```


- Tabnine [Close]

  :link:[AI assistant for software developers](https://www.tabnine.com/)

  :black_flag:[introduction](./Model/2-Tabnine.md)

  ```yaml
  Model Architecture: LLM
  Params: -
  Training Data: -
  Training Time: -
  Languages: -
  Evaluation: -
  Supported Tasks: Whole line completions,Full-function completions,Natural language to code completions
  ```

- AlphaCode [DeepMind] [2022.03] [Close]

  :page_with_curl:[Competition-Level Code Generation with AlphaCode](https://arxiv.org/abs/2203.07814)

  :black_flag:[introduction](./Model/3-AlphaCode(DeepMind).md)
  
  ```yaml
  Model Architecture: Encoder-Decoder
  Params: 41B
  Training Data: Collected[Code:715.1GB]
  Training Time: -
  Languages: 12langs
  Evaluation: HumanEval,APPS,CodeContest
  Supported Tasks: Competition-Level Code Generation
  ```

- PaLM-Coder [Google] [2022.04] [Close]

  :page_with_curl:[PaLM: Scaling Language Modeling with Pathways](https://arxiv.org/abs/2203.07814)

  :black_flag:[introduction](./Model/4-PaLM-Coder(Google).md)

  ```yaml
  Model Architecture: Decoder Only
  Params: 8B, 62B, 540B
  Training Data: Collected[Text: 741B tokens Code: 39GB(780B tokens trained)]
  Training Time: 6144 TPU
  Languages: Multiple
  Evaluation: HumanEval,MBPP,TransCoder,DeepFix
  Supported Tasks: Code Generation,Code Translation,Code Repa 
  ```

- PolyCoder [CMU] [2022.02] [[Open]](https://github.com/VHellendoorn/Code-LMs)

  :page_with_curl:[A Systematic Evaluation of Large Language Models of Code](https://arxiv.org/abs/2203.07814)

  :black_flag:[introduction](./Model/5-PolyCoder(CMU).md)

  ```yaml
  Model Architecture: Decoder Only，GPT Family
  Params: 2.7B
  Training Data: Collected[Code: 253.6GB]
  Training Time: - 
  Languages: 12 langs
  Evaluation: HumanEval
  Supported Tasks: Code Generation
  ```


- GPT-Neo [EleutherAI] [2021.03] [[Open]](https://github.com/EleutherAI/gpt-neo)

  :page_with_curl:GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow

  :black_flag:[introduction](./Model/6-GPT-Neo.md)

  ```yaml
  Model Architecture: Decoder Only,GPT Family
  Params: 1.3B, 2.7B
  Training Data: The Pile[Text: 730GB Code: 96GB(400B tokens trained)]
  Training Time: -
  Languages: Multiple
  Evaluation: HumanEval
  Supported Tasks: Code Generation
  ```


- GPT-NeoX [EleutherAI] [2022.04] [[Open]](https://github.com/EleutherAI/gpt-neox)

  :page_with_curl:[GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745)

  :black_flag:[introduction](./Model/7-GPT-NeoX.md)

  ```yaml
  Model Architecture: Decoder Only,GPT Family
  Params: 20B
  Training Data: The Pile[Text: 730GB Code: 95GB(473B tokens trained)]
  Training Time: -
  Languages: Multiple
  Evaluation: HumanEval
  Supported Tasks: Code Generation
  ```

- GPT-J [EleutherAI] [2021.06] [[Open]](https://github.com/kingoflolz/mesh-transformer-jax)

  :link:[GPT-J-6B: 6B JAX-Based Transformer](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/)

  :black_flag:[introduction](./Model/8-GPT-J.md)

  ```yaml
  Model Architecture: Decoder Only,GPT Family
  Params: 6B
  Training Data: The Pile[Text: 730GB Code: 96GB(473B tokens trained)]
  Training Time: -
  Languages: Multiple
  Evaluation: HumanEval
  Supported Tasks: Code Generation
  ```

- Incoder [Meta] [2022.04] [[Open]](https://sites.google.com/view/incoder-code-models/)

  :page_with_curl:[InCoder: A Generative Model for Code Infilling and Synthesis](https://arxiv.org/abs/2204.05999)

  :black_flag:[introduction](./Model/9-Incoder(Meta).md)

  ```yaml
  Model Architecture: Decoder Only
  Params: 1.3B,6.7B
  Training Data: Collected[Code: 159GB StackOverflow: 57GB(60B tokens trained)]
  Training Time: -
  Languages: 28 langs
  Evaluation: HumanEval,MBPP,CodeXGLUE
  Supported Tasks: Infilling Lines Of Code (HumanEval),Docstring Generation (CodeXGLUE), Return Type Prediction,Varible Name Predic
  ```

- CodeGen [Salesforce] [2022.03] [[Open]](https://github.com/salesforce/CodeGen) :star2:popular 

  :page_with_curl:[CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis](https://arxiv.org/abs/2203.13474)

  :black_flag:[introduction](./Model/10.11-CodeGen.md)

  ```yaml
  Model Architecture: Decoder Only
  Params: 6.1B, 16.1B
  Training Data: The Pile,BigQuery,BigPython[Code:150B tokens Text:355B tokens]
  Training Time: -
  Languages: CodeGen-Multi(6 langs),CodeGen-Mono(python)
  Evaluation: HumanEval, MTPB
  Supported Tasks: Single-Turn Code Generation,Multi-Turn Code Generation
  ```

- CodeGeeX [THU] [2022.09] [[Open]](https://github.com/THUDM/CodeGeeX)

  :page_with_curl:[CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X](https://arxiv.org/abs/2303.17568)

  :black_flag:[introduction](./Model/12-CodeGeeX(THU).md)

  ```yaml
  Model Architecture: Decoder Only,GPT Family
  Params: 13B
  Training Data: The Pile,CodeParrot,Collected[Code: 158B tokens(850B tokens trained)]
  Training Time: 1536 Ascend 910 AI processors (32GB) with Mindspore (v1.7.0),two months
  Languages: 23 langs
  Evaluation: HumanEval-X,HumanEval,MBPP,CodeXGLUE,XLCoST
  Supported Tasks: Multilingual Code Generation,Code Translation
  ```

- AiXcoder [PKU] [Close]

  :link:[AixCoder](https://www.aixcoder.com/#/)

  :black_flag:[introduction](./Model/14-AixCoder.md)

  ```yaml
  Model Architecture: -
  Params: 13B?
  Training Data: -
  Training Time: -
  Languages: Multiple
  Evaluation: -
  Supported Tasks: Code Generation,Code Completion,Code Search
  ```

- Pangu-Coder [Huawei Noah’s Ark Lab] [2022.07] [Close]

  :page_with_curl:[PanGu-Coder: Program Synthesis with Function-Level Language Modeling](https://arxiv.org/abs/2207.11280)

  :black_flag:[introduction](./Model/15-Pangu-Coder.md)

  ```yaml
  Model Architecture: PANGU-α architecture,Decoder Only
  Params: 2.6 B
  Training Data: Collected(147GB)
  Training Time: - 
  Languages: python
  Evaluation: HumanEval,MBPP
  Supported Tasks: Code Generation
  ```

- ERNIE-Code [Baidu] [2022.12] [Close]

  :warning:Don't think ERNIE-Code is a Code LLMs

  :page_with_curl:[ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages](https://arxiv.org/abs/2212.06742)

  :black_flag:[introduction](./Model/16-ERNIE-code.md)

  ```yaml
  Model Architecture: Encoder-Decoder,T5-base
  Params: 560M
  Training Data: CodeSearchNet,NL Corpus
  Training Time: - 
  Languages: Multiple
  Evaluation: mCoNaLa,Bugs2Fix,Microsoft Docs
  Supported Tasks: Multilingual Code-to-Text, Text-to-Code, Code-to-Code, and Text-to-Text Generation.
  ```

  


## Improve Code LLMs

- AlphaCode:[Competition-Level Code Generation with AlphaCode](https://arxiv.org/abs/2203.07814)
- CodeT:[CodeT: Code Generation with Generated Tests](https://arxiv.org/abs/2207.10397)



## Dataset

- **The Pile**

  :link:[repo](https://github.com/EleutherAI/the-pile):black_flag:[introduction](./Dataset/The-Pile.md)

- **BigQuery(BIGQUERY）**

  :link:[repo](https://cloud.google.com/bigquery/public-data?hl=zh-cn):black_flag:[introduction](./Dataset/BigQuery.md)

- **CodeParrot**

  :link:[repo](https://huggingface.co/datasets/codeparrot/github-code):black_flag:[introduction](./Dataset/CodeParrot.md)

- **The Stack**

  :link:[repo](https://huggingface.co/datasets/bigcode/the-stack):black_flag:[introduction](./Dataset/The-Stack.md)

- **PolyCoder**

  :link:[repo](https://github.com/VHellendoorn/Code-LMs):black_flag:[introduction](./Dataset/PolyCoder.md)

- **CodeSearchNet**

  :link:[repo](https://github.com/github/CodeSearchNet):black_flag:[introduction](./Dataset/CodeSearchNet.md)

- **ProjectCodeNet**

  :link:[repo](https://github.com/IBM/Project_CodeNet):black_flag:[introduction](./Dataset/ProjectCodeNet.md)

- **BigPython**

  :closed_lock_with_key:close:black_flag:[introduction](./Dataset/BigPython.md)

- **Collected**

  :spider:crawled data:black_flag:[introduction](./Dataset/Collected.md)




## Benchmark

- **HumanEval** 

  :link: [repo](https://github.com/openai/human-eval):page_with_curl:[paper](https://arxiv.org/abs/2107.03374):black_flag:[introduction](./Benchmark/HumanEval.md)

- **APPS**

  :link: [repo](https://github.com/hendrycks/apps):page_with_curl:[paper](https://arxiv.org/pdf/2105.09938.pdf):black_flag:[introduction](./Benchmark/APPS.md)

- **MBPP**

  :link: [repo](https://github.com/google-research/google-research/tree/master/mbpp):page_with_curl:[paper](https://arxiv.org/abs/2108.07732):black_flag:[introduction](./Benchmark/MBPP.md)

- **CodeXGLUE**

  :link: [repo](https://github.com/microsoft/CodeXGLUE):page_with_curl:[paper](https://arxiv.org/abs/2102.04664):black_flag:[introduction](./Benchmark/CodeXGLUE.md)

- **CodeContest**

  :link: [repo](https://github.com/deepmind/code_contests):page_with_curl:[paper](https://arxiv.org/abs/2203.07814):black_flag:[introduction](./Benchmark/CodeContest.md)

- **TransCoder**

  :link: [repo](https://github.com/facebookresearch/TransCoder):page_with_curl:[paper](https://arxiv.org/pdf/2006.03511.pdf):black_flag:[introduction](./Benchmark/TransCoder.md)

- **DeepFix**

  :link: [repo](https://bitbucket.org/iiscseal/deepfix/src/master/):page_with_curl:[paper](https://ojs.aaai.org/index.php/AAAI/article/view/10742):black_flag:[introduction](./Benchmark/DeepFix.md)

- **MTPB**

  :link: [repo](https://github.com/salesforce/CodeGen/tree/main/benchmark):page_with_curl:[paper](https://arxiv.org/abs/2203.13474):black_flag:[introduction](./Benchmark/MTPB.md)

- **HumanEval-X**

  :link: [repo](https://github.com/THUDM/CodeGeeX/blob/main/codegeex/benchmark/README_zh.md):page_with_curl:[paper](https://arxiv.org/abs/2303.17568):black_flag:[introduction](./Benchmark/HumanEval-X.md)

- **XLCoST**

  :link: [repo](https://github.com/reddy-lab-code-research/XLCoST):page_with_curl:[paper](https://arxiv.org/pdf/2206.08474.pdf):black_flag:[introduction](./Benchmark/Xlcost.md)

- **DS-1000**

  :link: [repo](https://ds1000-code-gen.github.io/):page_with_curl:[paper](https://arxiv.org/abs/2211.11501):black_flag:[introduction](./Benchmark/DS-1000.md)

- **ODEX**

  :link: [repo](https://github.com/zorazrw/odex):page_with_curl:[paper](https://arxiv.org/pdf/2212.10481.pdf):black_flag:[introduction](./Benchmark/ODEX.md)





## Future

​	[Future development](./Other/Future.md)