# VulnLLM-R **Repository Path**: Mr1024/VulnLLM-R ## Basic Information - **Project Name**: VulnLLM-R - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-07 - **Last Updated**: 2026-04-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # VulnLLM-R: Specialized Reasoning LLM for Vulnerability Detection * **Paper:** [arXiv:2512.07533](https://arxiv.org/abs/2512.07533) * **Code & Data:** [GitHub](https://github.com/ucsb-mlsec/VulnLLM-R) * **Demo:** [Web demo](https://huggingface.co/spaces/UCSB-SURFI/VulnLLM-R) * **Model:** [7B Model](https://huggingface.co/UCSB-SURFI/VulnLLM-R-7B) model_size_vs_f1_scatter_01

## Environment and dataset ### 🛠️ Create environment - Install [Git LFS](https://git-lfs.com/) and clone the repository (LFS files are fetched automatically during clone) ```shell git lfs install # one-time setup git clone https://github.com/ucsb-mlsec/VulnLLM-R.git cd VulnLLM-R # If you cloned before installing Git LFS, run: git lfs pull ``` - Create a new conda environment ```shell conda create -n vulnscan python=3.11 conda activate vulnscan ``` - Install the required packages ```shell pip install -e . -e ./vulscan/train/LLaMA-Factory -e ./vulscan/model_zoo ``` ## For Reproducing Our Results ```shell # generate VulnLLM-R-7B's results python -m vulscan.test.test --output_dir results/test_data --dataset_path ./datasets/test/function_level/ ./datasets/test/repo_level/ --language python c java --model UCSB-SURFI/VulnLLM-R-7B --requests_per_minute 1000 --save --use_cot --batch_size 4 --tp 2 --vllm --max_tokens 8192 --random_cwe python -m vulscan.test.test_hf \ --output_dir results/test_hf \ --hf_dataset UCSB-SURFI/VulnLLM-R-Test-Data \ --hf_split repo_level function_level \ --language c python java \ --model UCSB-SURFI/VulnLLM-R-7B \ --save --use_cot --vllm --tp 2 # [optional] generate other models' results with our shell script # remember to add your API keys to .env file if you want to run commercial models # use ./run_test.sh -h for more options ./vulscan/test/run_test.sh -o results/test_data -t 2 # -o means output directory, -t means tensor parallelism ./vulscan/test/run_test.sh -o results/test_data -M o3-mini # -M means model name, which runs only one model. ./vulscan/test/run_test.sh -o results/test_data -M gpt-5.4 -e high # -e sets reasoning effort (e.g., none/low/medium/high/xhigh) ./vulscan/test/run_test.sh -o results/test_data -M claude-opus-4-6 -e high # [optional] draw plot to compare with other models python plots/plot_language_comparison_models.py --results-dir results/test_data python plots/plot_model_size_scatter.py --results-dir results/test_data # Note: Labels may overlap with scatter points. Adjust text positions manually if needed. ``` ## Existing Distilled Datasets - [Distill-DeepSeek](https://huggingface.co/datasets/UCSB-SURFI/Distill-DeepSeek) - [Distill-QwQ](https://huggingface.co/datasets/UCSB-SURFI/Distill-QwQ) We also provide the reduced reasoning version of the distilled datasets: - [Reduced-Distill-DeepSeek](https://huggingface.co/datasets/UCSB-SURFI/Reduced-Distill-DeepSeek) - [Reduced-Distill-QwQ](https://huggingface.co/datasets/UCSB-SURFI/Reduced-Distill-QwQ) ## Technical Details ### 📚 Construct training and testing datasets Merge existing function-level vulnerability detection datasets: PrimeVul [1], SecCodePLT [2], Juliet [3], Sven [4], and Arvo [5]. Within these datasets, PrimeVul has the most complicated functions. We create two training sets: clean (without PrimeVul) and noisy (with PrimeVul), so we can train on relatively simple datasets and test on the complex PrimeVul dataset. Note that we name the training set with PrimeVul as noisy not means the dataset is noisy. It is a relatively arbitrary name we used at the beginning. - After download all the datasets, ```vulscan/data_process/data_utils``` has a set of scripts to process and merge the datasets. - ```raw_to_us.py```: Merge the raw data into our dataset and remove redundant data - ```check_cwe_correct.py```: Compute the accuracy for each CWE category - ```generate_arvo_raw_data.py```: Generate structured raw data from arvo dataset - ```arvo_to_us.py```: Reformat arvo structured raw data to our dataset format - ```split_good_bad_for_juliet.py```: Extract data from the raw Juliet 1.3 dataset and convert it into the required format, which forms part of our c clean_dataset - ```add_sven_to_clean_dataset.py```: Extract data from the Sven dataset, forming part of our C clean dataset - ```sync_large_small.py```: Synchronize the modifications of noisy_dataset/large_train/c to noisy_dataset/small_train/c - ```remove_testing_from_training.py```: Add the `human` tag to each data, meaning the point has been verified by human and used as testing data -```data_utils.py```: Add the related_cwe field to dataset. - The merged data will be saved in - ``datasets/clean_dataset``: the training data without PrimeVul - ``datasets/clean_dataset/python`` has the data from SVEN and SecCodePLT - ``datasets/clean_dataset/c`` has the data from Juliet and SVEN - ``datasets/noisy_dataset`` - ``datasets/noisy_dataset/small_train``: Contains the training data from PrimeVul and SVEN with selected CWEs ( we use the PrimeVul data in this dataset as the training) - ``datasets/noisy_dataset/large_train``: Contains the training data from PrimeVul and SVEN and SecCodePLT with more CWEs (This dataset can later be used to train larger models) - ``datasets/noisy_dataset/test``: A small testing set from PrimeVul verified by human - ``datasets/test`` - ``datasets/test/test_clean``: The testing data from SVEN and SecCodePLT and Juliet; with OOD CWEs that are not part of the training set - ``datasets/test/test_primevul_pair``: The original PrimeVul testing data - Dataset statistics; can run ``vulscan/data_process/data_utils/get_cwe_stat.py`` to get the histogram of the dataset | Dataset | Language | Train/test | CWE | # Benign | # Vuln. | average length | |-------------------------------|----------|------------|-------------|----------|---------|----------------| | Clean (seccodeplt) | Python | Train | 20 | 1281 | 1281 | 741 | | Clean (juliet) | C/C++ | Train | 22 | 1716 | 1653 | 3689 | | Hard (primevul filtered) | C/C++ | Train | 26 | 2717 | 2952 | 4689 | | Long Context (Oss-fuzz) | C/C++ | Train | 3 | 475 | 604 | 12761 | | Simple (seccodeplt) | Python | Test | 24 (6 ood) | 74 | 74 | 814 | | Simple (juliet) | C/C++ | Test | 38 (14 ood) | 358 | 376 | 2575 | | Hard (PrimeVul, SecLLMHolmes) | C/C++ | Test | 13 (5 ood) | 145 | 152 | 4545 | | Long Context (Oss-fuzz) | C/C++ | Test | 3 (0 ood) | 0 | 320 | 18929 | | primevul test (noisy) | C/C++ | Test | 56 (34 ood) | 421 | 422 | 5341 | ### 🤔 Generate reasoning data for training After constructing the datasets, we will generate reasoning data for our training set. We will query the DeepSeek-r1 and QwQ reasoning model to generate the reasoning data and filter out the ones with very long reasoning chains. The code for generating reasoning data is in `vulscan/data_process/generate_reasoning` and the reasoning data will be saved in `datasets/reasoning_data`. ```shell cd vulscan/data_process/generate_reasoning ``` `generate_reasoning/generate.py` is the main script for generating reasoning data. For each data point, it will generate n reasoning data samples and select the one with the correct answer and shortest length. Examples of running it with the QwQ and DeepSeek-r1 models (using the together.AI API, which is slower but more stable than the official API) are as follows: ```shell python generate.py \ --tp 2 \ --dataset_type clean_dataset \ --batch_size 200 \ --n 8 \ --training_set train \ --model_name Qwen/QwQ-32B # or python generate.py \ --dataset_type noisy_dataset \ --batch_size 200 \ --n 8 \ --training_set small_train \ --model_name together-deepseek-reasoner \ --together_deepseek ``` After generating the raw reasoning data, we can further use another model to summarize them and make them shorter without breaking the structure ```shell python extract_reasoning.py ``` We can further filter the reasoning data based on certain length with `generate_reasoning/filter.py`; `num_processes` are changes according to your number of CPU cores. ```shell python filter.py \ --dataset_type noisy_dataset \ --training_set small_train \ --model_name Qwen/QwQ-32B \ --filter_input_length 16000 \ --filter_all_length 32000 \ --num_processes 16 \ --filter_correct_only # for filtering wrong predictions # --model_name together-deepseek-reasoner \ ``` Finally, we will need to reformat the generated reasoning data for the target model that we will train (Qwen-Instruct). ```shell python reformat_ds.py \ --dataset_type noisy_dataset \ --training_set small_train \ --model_name Qwen/QwQ-32B \ --filter_input_length 16000 \ --filter_all_length 32000 \ --push_to_hub \ --push_to_hub_organization secmlr \ --filter_correct_only ``` For DPO dataset: ```shell python generate_dpo.py \ --tp 2 --dataset_type clean_dataset \ --batch_size 200 --n 8 --training_set train \ --model secmlr/VD-QWQ-Clean-8k_qwen2_7B_full_sft_1e-5 ``` ## 🤖 SFT and DPO Training refer to [`vulscan/train/README.md`](vulscan/train/README.md) for more details results will be saved in `results/test_qwen/results.json` directory. ## 🔍 Test the trained models If you want to reproduce our results, you can run the following command: ```shell # open-source model python -m vulscan.test.test --output_dir results/one_of_4 \ --dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \ --language c python --model Qwen/Qwen2.5-7B-Instruct \ --requests_per_minute 100 --save --use_cot \ --use_policy --batch_size 4 --tp 2 --vllm --max_tokens 16384 \ --random_cwe # api model python -m vulscan.test.test --output_dir results/one_of_4 \ --dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \ --language c python --model o3-mini-2025-01-14 \ --requests_per_minute 100 --save --use_cot \ --use_policy --batch_size 4 --max_tokens 16384 --random_cwe # local saved model python -m vulscan.test.test --output_dir results/one_of_4 \ --dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \ --language c python \ --model vulscan/train/result/VD-QWQ-Clean-16k/qwen2_7B_full_sft_1e-5 \ --requests_per_minute 100 --save --use_cot --use_policy \ --batch_size 4 --tp 2 --vllm --max_tokens 16384 --random_cwe # our model python -m vulscan.test.test --output_dir results/one_of_4 \ --dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \ --language c python \ --model secmlr/VD-QWQ-Noisy-Small-8k_qwen2_7B_full_sft_1e-5 --revision aa3235b \ --requests_per_minute 100 --save --use_cot --use_policy \ --batch_size 4 --tp 2 --vllm --max_tokens 16384 \ --random_cwe # whether to randomize the order of cwe and related cwes ``` After testing, model responses and performance will be saved in `results/test_data` directory. If you want to calculate the performance according to the model responses, you can run the following command: ```shell python -m vulscan.test.test_existing_json \ --json_file results/test_data/datasets_test_test_clean__cot_c_policy_QwQ-32B-Preview.json # the results file ``` ```shell python generate_constitution.py --model gpt-4o --input_dir results/train --output_dir results/train/constitution ``` ## Citation ```bibtex @article{nie2025vulnllmrspecializedreasoningllm, title={VulnLLM-R: Specialized Reasoning LLM with Agent Scaffold for Vulnerability Detection}, author={Yuzhou Nie and Hongwei Li and Chengquan Guo and Ruizhe Jiang and Zhun Wang and Bo Li and Dawn Song and Wenbo Guo}, year={2025}, journal={arXiv preprint arXiv:2512.07533}, url={https://arxiv.org/abs/2512.07533}, } ```