# VulnLLM-R
**Repository Path**: Mr1024/VulnLLM-R
## Basic Information
- **Project Name**: VulnLLM-R
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-07
- **Last Updated**: 2026-04-07
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# VulnLLM-R: Specialized Reasoning LLM for Vulnerability Detection
* **Paper:** [arXiv:2512.07533](https://arxiv.org/abs/2512.07533)
* **Code & Data:** [GitHub](https://github.com/ucsb-mlsec/VulnLLM-R)
* **Demo:** [Web demo](https://huggingface.co/spaces/UCSB-SURFI/VulnLLM-R)
* **Model:** [7B Model](https://huggingface.co/UCSB-SURFI/VulnLLM-R-7B)
## Environment and dataset
### 🛠️ Create environment
- Install [Git LFS](https://git-lfs.com/) and clone the repository (LFS files are fetched automatically during clone)
```shell
git lfs install # one-time setup
git clone https://github.com/ucsb-mlsec/VulnLLM-R.git
cd VulnLLM-R
# If you cloned before installing Git LFS, run: git lfs pull
```
- Create a new conda environment
```shell
conda create -n vulnscan python=3.11
conda activate vulnscan
```
- Install the required packages
```shell
pip install -e . -e ./vulscan/train/LLaMA-Factory -e ./vulscan/model_zoo
```
## For Reproducing Our Results
```shell
# generate VulnLLM-R-7B's results
python -m vulscan.test.test --output_dir results/test_data --dataset_path ./datasets/test/function_level/ ./datasets/test/repo_level/ --language python c java --model UCSB-SURFI/VulnLLM-R-7B --requests_per_minute 1000 --save --use_cot --batch_size 4 --tp 2 --vllm --max_tokens 8192 --random_cwe
python -m vulscan.test.test_hf \
--output_dir results/test_hf \
--hf_dataset UCSB-SURFI/VulnLLM-R-Test-Data \
--hf_split repo_level function_level \
--language c python java \
--model UCSB-SURFI/VulnLLM-R-7B \
--save --use_cot --vllm --tp 2
# [optional] generate other models' results with our shell script
# remember to add your API keys to .env file if you want to run commercial models
# use ./run_test.sh -h for more options
./vulscan/test/run_test.sh -o results/test_data -t 2 # -o means output directory, -t means tensor parallelism
./vulscan/test/run_test.sh -o results/test_data -M o3-mini # -M means model name, which runs only one model.
./vulscan/test/run_test.sh -o results/test_data -M gpt-5.4 -e high # -e sets reasoning effort (e.g., none/low/medium/high/xhigh)
./vulscan/test/run_test.sh -o results/test_data -M claude-opus-4-6 -e high
# [optional] draw plot to compare with other models
python plots/plot_language_comparison_models.py --results-dir results/test_data
python plots/plot_model_size_scatter.py --results-dir results/test_data # Note: Labels may overlap with scatter points. Adjust text positions manually if needed.
```
## Existing Distilled Datasets
- [Distill-DeepSeek](https://huggingface.co/datasets/UCSB-SURFI/Distill-DeepSeek)
- [Distill-QwQ](https://huggingface.co/datasets/UCSB-SURFI/Distill-QwQ)
We also provide the reduced reasoning version of the distilled datasets:
- [Reduced-Distill-DeepSeek](https://huggingface.co/datasets/UCSB-SURFI/Reduced-Distill-DeepSeek)
- [Reduced-Distill-QwQ](https://huggingface.co/datasets/UCSB-SURFI/Reduced-Distill-QwQ)
## Technical Details
### 📚 Construct training and testing datasets
Merge existing function-level vulnerability detection datasets:
PrimeVul [1],
SecCodePLT [2],
Juliet [3],
Sven [4],
and Arvo [5].
Within these datasets, PrimeVul has the most complicated functions.
We create two training sets: clean (without PrimeVul) and noisy (with PrimeVul),
so we can train on relatively simple datasets and test on the complex PrimeVul dataset.
Note that we name the training set with PrimeVul as noisy not means the dataset is noisy.
It is a relatively arbitrary name we used at the beginning.
- After download all the datasets, ```vulscan/data_process/data_utils``` has a set of scripts to process and merge the
datasets.
- ```raw_to_us.py```: Merge the raw data into our dataset and remove redundant data
- ```check_cwe_correct.py```: Compute the accuracy for each CWE category
- ```generate_arvo_raw_data.py```: Generate structured raw data from arvo dataset
- ```arvo_to_us.py```: Reformat arvo structured raw data to our dataset format
- ```split_good_bad_for_juliet.py```: Extract data from the raw Juliet 1.3 dataset and convert it into the required
format, which forms part of our c clean_dataset
- ```add_sven_to_clean_dataset.py```: Extract data from the Sven dataset, forming part of our C clean dataset
- ```sync_large_small.py```: Synchronize the modifications of noisy_dataset/large_train/c to
noisy_dataset/small_train/c
- ```remove_testing_from_training.py```: Add the `human` tag to each data, meaning the point has been verified by
human and used as testing data
-```data_utils.py```: Add the related_cwe field to dataset.
- The merged data will be saved in
- ``datasets/clean_dataset``: the training data without PrimeVul
- ``datasets/clean_dataset/python`` has the data from SVEN and SecCodePLT
- ``datasets/clean_dataset/c`` has the data from Juliet and SVEN
- ``datasets/noisy_dataset``
- ``datasets/noisy_dataset/small_train``: Contains the training data from PrimeVul and SVEN with selected CWEs (
we use the PrimeVul data in this dataset as the training)
- ``datasets/noisy_dataset/large_train``: Contains the training data from PrimeVul and SVEN and SecCodePLT with
more CWEs (This dataset can later be used to train larger models)
- ``datasets/noisy_dataset/test``: A small testing set from PrimeVul verified by human
- ``datasets/test``
- ``datasets/test/test_clean``: The testing data from SVEN and SecCodePLT and Juliet; with OOD CWEs that are not
part of the training set
- ``datasets/test/test_primevul_pair``: The original PrimeVul testing data
- Dataset statistics; can run ``vulscan/data_process/data_utils/get_cwe_stat.py`` to get the histogram of the dataset
| Dataset | Language | Train/test | CWE | # Benign | # Vuln. | average length |
|-------------------------------|----------|------------|-------------|----------|---------|----------------|
| Clean (seccodeplt) | Python | Train | 20 | 1281 | 1281 | 741 |
| Clean (juliet) | C/C++ | Train | 22 | 1716 | 1653 | 3689 |
| Hard (primevul filtered) | C/C++ | Train | 26 | 2717 | 2952 | 4689 |
| Long Context (Oss-fuzz) | C/C++ | Train | 3 | 475 | 604 | 12761 |
| Simple (seccodeplt) | Python | Test | 24 (6 ood) | 74 | 74 | 814 |
| Simple (juliet) | C/C++ | Test | 38 (14 ood) | 358 | 376 | 2575 |
| Hard (PrimeVul, SecLLMHolmes) | C/C++ | Test | 13 (5 ood) | 145 | 152 | 4545 |
| Long Context (Oss-fuzz) | C/C++ | Test | 3 (0 ood) | 0 | 320 | 18929 |
| primevul test (noisy) | C/C++ | Test | 56 (34 ood) | 421 | 422 | 5341 |
### 🤔 Generate reasoning data for training
After constructing the datasets, we will generate reasoning data for our training set.
We will query the DeepSeek-r1 and QwQ reasoning model to generate the reasoning data and filter out the ones with very
long reasoning chains.
The code for generating reasoning data is in `vulscan/data_process/generate_reasoning` and the reasoning data will be
saved
in `datasets/reasoning_data`.
```shell
cd vulscan/data_process/generate_reasoning
```
`generate_reasoning/generate.py` is the main script for generating reasoning data.
For each data point, it will generate n reasoning data samples and select the one with the correct answer and shortest
length.
Examples of running it with the QwQ and DeepSeek-r1 models (using the together.AI API, which is slower but more stable
than the official API) are as follows:
```shell
python generate.py \
--tp 2 \
--dataset_type clean_dataset \
--batch_size 200 \
--n 8 \
--training_set train \
--model_name Qwen/QwQ-32B
# or
python generate.py \
--dataset_type noisy_dataset \
--batch_size 200 \
--n 8 \
--training_set small_train \
--model_name together-deepseek-reasoner \
--together_deepseek
```
After generating the raw reasoning data, we can further use another model to summarize them and make them shorter
without breaking the structure
```shell
python extract_reasoning.py
```
We can further filter the reasoning data based on certain length with `generate_reasoning/filter.py`; `num_processes`
are changes according to your number of CPU cores.
```shell
python filter.py \
--dataset_type noisy_dataset \
--training_set small_train \
--model_name Qwen/QwQ-32B \
--filter_input_length 16000 \
--filter_all_length 32000 \
--num_processes 16 \
--filter_correct_only # for filtering wrong predictions
# --model_name together-deepseek-reasoner \
```
Finally, we will need to reformat the generated reasoning data for the target model that we will train (Qwen-Instruct).
```shell
python reformat_ds.py \
--dataset_type noisy_dataset \
--training_set small_train \
--model_name Qwen/QwQ-32B \
--filter_input_length 16000 \
--filter_all_length 32000 \
--push_to_hub \
--push_to_hub_organization secmlr \
--filter_correct_only
```
For DPO dataset:
```shell
python generate_dpo.py \
--tp 2 --dataset_type clean_dataset \
--batch_size 200 --n 8 --training_set train \
--model secmlr/VD-QWQ-Clean-8k_qwen2_7B_full_sft_1e-5
```
## 🤖 SFT and DPO Training
refer to [`vulscan/train/README.md`](vulscan/train/README.md) for more details
results will be saved in `results/test_qwen/results.json` directory.
## 🔍 Test the trained models
If you want to reproduce our results, you can run the following command:
```shell
# open-source model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python --model Qwen/Qwen2.5-7B-Instruct \
--requests_per_minute 100 --save --use_cot \
--use_policy --batch_size 4 --tp 2 --vllm --max_tokens 16384 \
--random_cwe
# api model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python --model o3-mini-2025-01-14 \
--requests_per_minute 100 --save --use_cot \
--use_policy --batch_size 4 --max_tokens 16384 --random_cwe
# local saved model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python \
--model vulscan/train/result/VD-QWQ-Clean-16k/qwen2_7B_full_sft_1e-5 \
--requests_per_minute 100 --save --use_cot --use_policy \
--batch_size 4 --tp 2 --vllm --max_tokens 16384 --random_cwe
# our model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python \
--model secmlr/VD-QWQ-Noisy-Small-8k_qwen2_7B_full_sft_1e-5 --revision aa3235b \
--requests_per_minute 100 --save --use_cot --use_policy \
--batch_size 4 --tp 2 --vllm --max_tokens 16384 \
--random_cwe # whether to randomize the order of cwe and related cwes
```
After testing, model responses and performance will be saved in `results/test_data` directory.
If you want to calculate the performance according to the model responses, you can run the following command:
```shell
python -m vulscan.test.test_existing_json \
--json_file results/test_data/datasets_test_test_clean__cot_c_policy_QwQ-32B-Preview.json # the results file
```
```shell
python generate_constitution.py --model gpt-4o --input_dir results/train --output_dir results/train/constitution
```
## Citation
```bibtex
@article{nie2025vulnllmrspecializedreasoningllm,
title={VulnLLM-R: Specialized Reasoning LLM with Agent Scaffold for Vulnerability Detection},
author={Yuzhou Nie and Hongwei Li and Chengquan Guo and Ruizhe Jiang and Zhun Wang and Bo Li and Dawn Song and Wenbo Guo},
year={2025},
journal={arXiv preprint arXiv:2512.07533},
url={https://arxiv.org/abs/2512.07533},
}
```