# CUDA-L1
**Repository Path**: lerenhua/CUDA-L1
## Basic Information
- **Project Name**: CUDA-L1
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-08-21
- **Last Updated**: 2025-08-21
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
| 🏠 Project Page | 📄 Paper | 🔥 Demo
## 🥳 Introduction
In this paper, we introduce CUDA-L1, an automated reinforcement learning (RL) framework for CUDA optimization. The core of CUDA-L1 is a contrastive RL model, a newly-designed RL system to enhance optimization through comparative learning.
Fig:Average speedup across different architectures on KernelBench over baselines.
## 🗒️ To-do List
- [x] Fix KernelBench evaluations with proper stream timing synchronization ✅
- [x] Remove caching ✅
- [x] Compare with torch.compile ✅
- [x] Compare with pytorch eager + cuda graph ✅
- [x] Compare with custom torch CUDA/cuDNN backend flags ✅
- [ ] 5090/4090
## 🩺 Evaluation Results
Our evaluation is conducted on the KernelBench [dataset](https://github.com/ScalingIntelligence/KernelBench), a collection of 250 PyTorch workloads designed to evaluate language models' ability to generate efficient GPU kernels.
**Table: Performance comparison across different configurations on KernelBench on A100.**
| Configuration |
Method |
Mean |
Max |
75% |
50% |
25% |
Success↑ # out of total |
Speedup↑ >1.01x out of total |
| Default |
All |
3.12× |
120× |
2.25× |
1.42× |
1.17× |
249/250 |
226/250 |
| Torch Compile |
All |
2.77× |
69.0× |
2.55× |
1.72× |
1.14× |
249/250 |
203/250 |
| Torch Compile RO |
All |
2.88× |
80.1× |
2.48× |
1.67× |
1.13× |
249/250 |
200/250 |
| CUDA Graph |
All |
2.81× |
97.9× |
1.83× |
1.20× |
0.954× |
249/250 |
147/229 |
• RO = Reduce Overhead
• Success and Speedup indicate the number of successful benchmarks out of the total for each level
**Table: Mean speedup across different configurations and GPU devices.**
| Configuration |
A100 |
3090 |
H100 |
H20 |
L40 |
| Default |
3.12× |
2.51× |
3.85× |
2.38× |
3.13× |
| Torch Compile |
2.77× |
2.58× |
2.74× |
2.89× |
2.85× |
| Torch Compile RO |
2.88× |
2.61× |
2.77× |
2.82× |
2.89× |
| CUDA Graph |
2.81× |
3.34× |
2.23× |
2.20× |
3.98× |
## ❓ How to reproduce the results?
We provide CUDA code snippets optimized by CUDA-L1 in the `optimized_cuda_code` folder, with separate versions for each GPU device. For example, to reproduce our results on H100 XSM, download `./optimized_cuda_code/h100.json` and run each code snippet on your H100 device.
## 📁 Structure of Release Code
Each line in the release file contains a JSON object with the following fields:
| Field | Description |
|-------|-------------|
| `level_id` | Level index in KernelBench (values: 1, 2, 3) |
| `task_id` | Task index for that level |
| `ref_code` | Reference CUDA code provided by KernelBench |
| `custom_code` | Optimized code generated by CUDA-L1 |
| `cuda_graph_code` | KernelBench reference code with CUDA Graph modifications |
| `score_default` | Execution time ratio: ref_code / custom_code |
| `score_torch_compile_default` | Execution time ratio: ref_code / custom_code (with torch.compile) |
| `score_torch_compile_reduce_overhead` | Execution time ratio: ref_code / custom_code (with torch.compile reduce_overhead mode) |
| `score_cuda_graph` | Execution time ratio: cuda_graph_code / custom_code |
**Note:** If `custom_code` is None, it means the RL either failed to generate code faster than the reference code or simply copied the reference code during generation.
### Example Entry Structure
```json
{
"level_id": 1,
"task_id": 1,
"ref_code": "import torch...",
"custom_code": "import torch...",
"cuda_graph_code": "import torch...",
"score_default": 1.762,
"score_torch_compile_default": 1.958,
"score_torch_compile_reduce_overhead": 2.118,
"score_cuda_graph": 1.566,
}
```
## 🔭 Limitations and Challenges
During the training process, we found that RL is particularly susceptible to reward hacking. We've already identified quite a few hacking cases (e.g., exploiting timing measurements & caching results). If you identify any additional reward hacks in the code, we would greatly appreciate you letting us know.
## 📇 Citation
```latex
@article{deepreinforce2025cudal1,
title={CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning},
author={Li, Xiaoya and Sun, Xiaofei and Wang, Albert and Li, Jiwei and Chris, Shum},
journal={arXiv preprint arXiv:2507.14111},
year={2025}
}
```
## ✉️ Contact
If you have any questions, please reach out to us at **research@deep-reinforce.com**.