# CUDA-L1 **Repository Path**: lerenhua/CUDA-L1 ## Basic Information - **Project Name**: CUDA-L1 - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-21 - **Last Updated**: 2025-08-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

| 🏠 Project Page | 📄 Paper | 🔥 Demo

## 🥳 Introduction In this paper, we introduce CUDA-L1, an automated reinforcement learning (RL) framework for CUDA optimization. The core of CUDA-L1 is a contrastive RL model, a newly-designed RL system to enhance optimization through comparative learning.

Fig：Average speedup across different architectures on KernelBench over baselines.

## 🗒️ To-do List - [x] Fix KernelBench evaluations with proper stream timing synchronization ✅ - [x] Remove caching ✅ - [x] Compare with torch.compile ✅ - [x] Compare with pytorch eager + cuda graph ✅ - [x] Compare with custom torch CUDA/cuDNN backend flags ✅ - [ ] 5090/4090 ## 🩺 Evaluation Results Our evaluation is conducted on the KernelBench [dataset](https://github.com/ScalingIntelligence/KernelBench), a collection of 250 PyTorch workloads designed to evaluate language models' ability to generate efficient GPU kernels.

**Table: Performance comparison across different configurations on KernelBench on A100.**

Configuration	Method	Mean	Max	75%	50%	25%	Success↑ ^{# out of total}	Speedup↑ ^{>1.01x out of total}
Default	All	3.12×	120×	2.25×	1.42×	1.17×	249/250	226/250
Torch Compile	All	2.77×	69.0×	2.55×	1.72×	1.14×	249/250	203/250
Torch Compile RO	All	2.88×	80.1×	2.48×	1.67×	1.13×	249/250	200/250
CUDA Graph	All	2.81×	97.9×	1.83×	1.20×	0.954×	249/250	147/229

^{• RO = Reduce Overhead}
^{• Success and Speedup indicate the number of successful benchmarks out of the total for each level}

**Table: Mean speedup across different configurations and GPU devices.**

Configuration	A100	3090	H100	H20	L40
Default	3.12×	2.51×	3.85×	2.38×	3.13×
Torch Compile	2.77×	2.58×	2.74×	2.89×	2.85×
Torch Compile RO	2.88×	2.61×	2.77×	2.82×	2.89×
CUDA Graph	2.81×	3.34×	2.23×	2.20×	3.98×

## ❓ How to reproduce the results? We provide CUDA code snippets optimized by CUDA-L1 in the `optimized_cuda_code` folder, with separate versions for each GPU device. For example, to reproduce our results on H100 XSM, download `./optimized_cuda_code/h100.json` and run each code snippet on your H100 device. ## 📁 Structure of Release Code Each line in the release file contains a JSON object with the following fields: | Field | Description | |-------|-------------| | `level_id` | Level index in KernelBench (values: 1, 2, 3) | | `task_id` | Task index for that level | | `ref_code` | Reference CUDA code provided by KernelBench | | `custom_code` | Optimized code generated by CUDA-L1 | | `cuda_graph_code` | KernelBench reference code with CUDA Graph modifications | | `score_default` | Execution time ratio: ref_code / custom_code | | `score_torch_compile_default` | Execution time ratio: ref_code / custom_code (with torch.compile) | | `score_torch_compile_reduce_overhead` | Execution time ratio: ref_code / custom_code (with torch.compile reduce_overhead mode) | | `score_cuda_graph` | Execution time ratio: cuda_graph_code / custom_code | **Note:** If `custom_code` is None, it means the RL either failed to generate code faster than the reference code or simply copied the reference code during generation. ### Example Entry Structure ```json { "level_id": 1, "task_id": 1, "ref_code": "import torch...", "custom_code": "import torch...", "cuda_graph_code": "import torch...", "score_default": 1.762, "score_torch_compile_default": 1.958, "score_torch_compile_reduce_overhead": 2.118, "score_cuda_graph": 1.566, } ``` ## 🔭 Limitations and Challenges During the training process, we found that RL is particularly susceptible to reward hacking. We've already identified quite a few hacking cases (e.g., exploiting timing measurements & caching results). If you identify any additional reward hacks in the code, we would greatly appreciate you letting us know. ## 📇 Citation ```latex @article{deepreinforce2025cudal1, title={CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning}, author={Li, Xiaoya and Sun, Xiaofei and Wang, Albert and Li, Jiwei and Chris, Shum}, journal={arXiv preprint arXiv:2507.14111}, year={2025} } ``` ## ✉️ Contact If you have any questions, please reach out to us at **research@deep-reinforce.com**.