# ExeCoder
**Repository Path**: mirrors_microsoft/ExeCoder
## Basic Information
- **Project Name**: ExeCoder
- **Description**: Train an LLM specifically designed for code translation, aimed at utilizing executability representations such as functional semantics, syntax structures, and vari- able dependencies to enhance the capabilities of LLMs in code translation.
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-04-14
- **Last Updated**: 2026-02-28
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# 🔥ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation
[[Project page]](https://execoder4trans.github.io/) [[Paper]](https://arxiv.org/abs/2501.18460)
​ Minghua He1*, Fangkai Yang2, Pu Zhao2, Wenjie Yin3, Yu Kang2, Qingwei Lin2, Saravan Rajmohan2, Dongmei Zhang2, Qi Zhang2
1Peking University, 2Microsoft, 3KTH Royal Institute of Technology
*Work is done during an internship at Microsoft.
## 📝 Project Structure
```
├─checkpoint # Saved models
├─data # IFT data
├─evaluation # Code Translation Evaluation
├─exe_repr_generation
| ├─lang_processors # Programming Language Processors
| ├─parser # Programming Language Parsers
| ├─ast_tools.py # Processing Syntactic-structure Representation
| ├─dataflow_tools.py # Processing Variable-dependency Representation
| └─deduplication.py # Deduplication data
| └─XLCoST_preprocess.py # Processing XLCoST
├─src # Run SFT
└─tools # JDK for Evaluation
└─TransCoder-test-X.zip # Enhanced Benchmark
```
## ⚙️ Environment
**Key Packages:**
datasets==2.18.0
fire==0.6.0
gradio==4.39.0
numpy==1.26.4
openai==0.8.0
pandas==2.2.2
torch==2.2.1
tqdm==4.64.1
transformers==4.42.4
tree_sitter==0.21.0
tree_sitter_go==0.21.0
tree_sitter_c_sharp==0.21.0
tree_sitter_java==0.21.0
tree_sitter_javascript==0.21.0
tree_sitter_php==0.22.4
tree_sitter_python==0.21.0
vllm==0.4.1
openpyxl==3.1.5
deepspeed==0.14.2
accelerate==1.0.1
tensorboardX
## 📜 Preparation
You need to follow these steps to **completely** run `ExeCoder`.
- **Step 1:** Download [XLCoST](https://github.com/reddy-lab-code-research/XLCoST) and put it under `data` folder.
- **Step 2:** Download [deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) and put it under `checkpoint` folder.
- **Step 3:** Download [jdk-10.0.2](#jdk) and put it under `tools` folder.
- **Step 4:** Prepare the dependencies in [Environment](#Environment).
## 🚀 Quick Start
you can run `ExeCoder` with this code:
- Preprocess XLCoST dataset to XLCoST-Instruct.
```
python exe_repr_generation/XLCoST_preprocess.py
```
- Instruction Tuning for Learning Executability Representation.
```
sh train.sh
```
- Inference.
```
sh inference.sh
```
- Evaluation.
```
sh evaluation.sh
```
## 📝 Citation and Reference
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
```
@misc{he2025execoderempoweringlargelanguage,
title={ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation},
author={Minghua He and Fangkai Yang and Pu Zhao and Wenjie Yin and Yu Kang and Qingwei Lin and Saravan Rajmohan and Dongmei Zhang and Qi Zhang},
year={2025},
eprint={2501.18460},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2501.18460},
}
```