# swallow-code-v2 **Repository Path**: hf-datasets/swallow-code-v2 ## Basic Information - **Project Name**: swallow-code-v2 - **Description**: Mirror of https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-11 - **Last Updated**: 2025-11-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README --- license: apache-2.0 task_categories: - text-generation language: - en tags: - code pretty_name: swallowcode2 size_categories: - 10M ### Resources - 📑 **arXiv**: Read our paper for detailed methodology and results at [arXiv:2505.02881](https://arxiv.org/abs/2505.02881). - 🤗 **Sister Dataset**: Discover [SwallowMath-v2](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2), our companion dataset for mathematical reasoning. ## 💻 What is it? [SwallowCode-v1](https://huggingface.co/datasets/tokyotech-llm/swallow-code) was a high-quality Python code dataset generated through an LLM-based rewriting pipeline. However, it had two significant limitations: (1) it was distributed under the **Llama 3.3 Community License**, and (2) its size was limited to **16.1 B** tokens, restricting large-scale pre-training. To address these issues, we built **SwallowCode-v2**, a fully rewritten Python corpus derived from [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) , using [Qwen3-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507). The resulting dataset contains **49.8 billion** tokens and is released under the **Apache 2.0 License**, ensuring both open accessibility and reproducibility for research and commercial use. As shown in the figure below, SwallowCode-v2 demonstrates stronger performance than other open-source code datasets on downstream code-generation benchmarks.
_{† Note: While datasets such as [OpenCoder](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta) and [NVIDIA/Nemotron-Pretraining-Code-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v1) are labeled “open,” they only release metadata, not the actual training samples.
Unlike The-Stack-v2, they cannot be directly downloaded from public storage (e.g., S3) and instead require large-scale re-crawling of GitHub repositories based on metadata.
For smaller open-source LLM projects, this reconstruction process is prohibitively expensive, making it impractical to reproduce or directly compare those datasets.
Hence, results for those corpora are omitted in our comparison.}

## 📊 Dataset Comparison | Dataset | Token Count (Llama-3 Tokenizer) | License | | :-------------------------------- | :-----------------------------: | :--------------------------------- | | **[Nemotron-Pretraining-Code-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v1)** | metadata release | [NVIDIA Open Data License Agreement](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample/raw/main/LICENSE.md) | | [Stack-Edu](https://huggingface.co/datasets/HuggingFaceTB/stack-edu) (python) | 17.9 B tokens | - | | **SwallowCode-v1 (our previous)** | 16.1 B tokens | Llama-3.3 Community License | | **SwallowCode-v2 (this work)** | 49.8 B tokens | **Apache 2.0 License** | ## 📦 What is being released? **SwallowCode-v2**: A **49.8 B**-token Apache-2.0-licensed Python code dataset rewritten from The-Stack-v2, designed for scalable LLM pre-training. All samples are auto-formatted, style-normalized, and enhanced for algorithmic clarity via a LLM rewriting pipeline. ## 🧩 Dataset curation 1. **Auto-Formatting** – Standardize code style using [ruff formatter](https://docs.astral.sh/ruff/). 2. **Length Filtering** – Remove excessively long or truncated samples. 3. **LLM Quality Scoring** – Rate each snippet for readability and style compliance (0–10 scale) using [SeedCoder](https://arxiv.org/abs/2506.03524) prompt for quality scoring. 4. **LLM Rewriting Phase** – Use Qwen3-235B-A22B-Instruct to rewrite and enhance code for clarity, structure, and algorithmic soundness. 5. **Post-Formatting** – Apply a final ruff pass to ensure uniform formatting and compliance.

### 🗂️ Dataset structure - **Stage 1** - auto-format: [stage1-auto-format/python](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage1-auto-format/python) - **Stage 2** - length-filter: [stage2-length-filter/python](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage2-length-filter/python) - **Stage 3** - llm-score: [stage3-llm-score/python](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage3-llm-score/python) - **Stage 4** - llm-rewrite: [stage4-llm-rewrite/python/medium](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage4-llm-rewrite/python/medium) - **Stage 5** - auto-format: [stage5-auto-format/python/medium](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage5-auto-format/python/medium) (**SwallowCode-v2**) ### 🧪 Rewriting ablation experiments To investigate how different LLM-based rewriting strategies affect the quality of generated code data, we conducted the following ablation experiments. All experiments involved **50B-token continual pre-training of Llama-3.1-8B**, and performance was tracked by measuring **HumanEval** and **HumanEval+** pass@1 scores over the course of training. By using datasets created with different rewriting strategies as the training corpus, we compared the effectiveness of each method. Insights obtained from these ablations directly informed the construction of **SwallowCode-v2**. #### Instruct vs Thinking model We compared the effectiveness of using an **Instruct** model and a **Thinking** model (both from Qwen-3-235B-A22B) for rewriting. As shown in the figure below, there was **no significant difference** in performance between data rewritten by the Instruct model and that by the Thinking model. However, the Thinking model outputs a `...` reasoning trajectory before producing the final rewritten code, leading to higher GPU cost per rewritten sample. Based on these findings, we adopted the **Instruct model for rewriting**, as it provides comparable quality at a substantially lower computational cost.

#### 1 stage Rewriting vs 2 stage Rewriting In [SwallowCode-v1](https://huggingface.co/datasets/tokyotech-llm/swallow-code), we employed a **2-stage rewriting process**. For SwallowCode-v2, we revisited the prompt design used in v1 to test whether a single-stage (1-stage) rewriting could achieve the same quality. Specifically, we combined the two stages of v1 into a single instruction, asking the LLM to perform the same overall rewriting within one step. However, since LLMs are known to ignore parts of overly complex prompts, we could not rule out that the act of explicitly separating the rewriting into two stages was itself beneficial. Therefore, we directly compared 1-stage and 2-stage rewriting. The results showed that **2-stage rewriting required nearly twice the GPU hours** but produced **similar downstream performance** to 1-stage rewriting. Consequently, we adopted the 1-stage rewriting strategy for SwallowCode-v2 construction.

#### High Quality vs Medium Quality Using the [SeedCoder](https://arxiv.org/abs/2506.03524) quality-scoring prompt, we evaluated and categorized source code data into **High**, **Medium**, and **Low** quality groups. Intuitively, one might expect that higher-quality inputs would yield better rewritten data. However, when we tested this hypothesis through HumanEval and HumanEval+ performance, the results showed the opposite trend — **rewriting from Medium-quality data slightly outperformed rewriting from High-quality data**, as shown below. We hypothesize that this may be due to distributional differences: High-quality code often includes complex, class-based implementations or heavy library use, whereas Medium-quality code tends to resemble the **simpler, problem-oriented** format of HumanEval tasks. This qualitative observation, while informative, remains a preliminary analysis and has not yet been verified through deeper experimentation.

## 📊 Results and Performance SwallowCode-v2 achieved **+20.7** and **+21.9** higher pass@1 scores on HumanEval and HumanEval+, respectively, compared to [Stack-Edu](https://huggingface.co/datasets/HuggingFaceTB/stack-edu). These experiments were conducted using Llama-3.1-8B.

## 📝 Note The SwallowCode-v2 project was originally designed to build a multilingual code dataset covering 13 programming languages. However, due to the substantial GPU hours and development effort required, and since SwallowCode-v2 and SwallowMath-v2 were both developed by three students in parallel with their main research, completing all subsets proved infeasible. We therefore decided to release the Python subset, which was fully constructed, as SwallowCode-v2. Future versions — SwallowCode-v3 / SwallowMath-v3 — are planned to be larger, higher-quality, and may incorporate Thinking-Augmentation and other advanced methodologies. However, the continuation of this project depends on strong demand from the open community or the potential for clear academic contribution. ## ⚖️ Licensing Information SwallowCode-v2 is released under the **Apache-2.0 License**. Usage is subject to [The-Stack-v2’s licensing terms](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids). ## 👥 Contributors The dataset was primarily developed by the following contributors: - [Kazuki Fujii](https://www.linkedin.com/in/kazuki-fujii/) — Designed the experiments, implemented the data pipeline, and conducted the experiments. - [Yukito Tajima](https://www.linkedin.com/in/yukito-tajima-51bbb2299/) — Implemented the data pipeline and optimized the inference pipeline. (vLLM, TensorRT-LLM) - [Masaki Kawamura](https://www.linkedin.com/in/masaki-kawamura-0806a7361/) — Co-designed the experiments, evaluated the models, and performed visualization and analysis. ## 📖 Citation ``` @misc{fujii2025rewritingpretrainingdataboosts, title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki}, year={2025}, eprint={2505.02881}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.02881}, } ```