# Impromptu-VLA
**Repository Path**: flashdxy/Impromptu-VLA
## Basic Information
- **Project Name**: Impromptu-VLA
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: CC-BY-SA-4.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-06-04
- **Last Updated**: 2025-06-06
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Impromptu-VLA
This repository contains the code for the following work:
> Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models
## [ProjectPage](http://Impromptu-VLA.c7w.tech/)
Haohan Chi*,¹, [Huan-ang Gao*,¹](https://c7w.tech/), Ziming Liu†,², Jianing Liu¹, Chenyu Liu¹, Jinwei Li¹, Kaisen Yang¹, Yangcheng Yu¹, Zeda Wang¹, Wenyi Li¹, Leichen Wang², Xingtao Hu², Hao Sun², [Hang Zhao³](https://hangzhaomit.github.io/), [Hao Zhao¹,†](https://sites.google.com/view/fromandto/)
¹AIR, Tsinghua University, ²Bosch Research, ³IIIS, Tsinghua University, *Equal contribution, †Corresponding author
## Introductory Video
Our dataset can be accessed at [huggingface](https://huggingface.co/datasets/aaaaaap/unstructed)
If you want to create our benchmark QA data from scratch:
1. First, organize the data download based on `data_raw`.
2. Parse the data according to the code and instructions in the folder (for the `waymo` and `mapillary_sls` datasets).
3. Enter the main directory.Create a symbolic link for `navsim`:
```bash
ln -s /data_raw/navsim /data_qa_generate/data_engine/data_storage/external_datasets/navsim
```
4. After the data is successfully organized, run the following script:
```bash
bash scripts/data_qa_generate.sh
```
---
### ✨ Environment Configuration
We leverage some powerful open-source libraries to make this project shine. To ensure a smooth experience, please configure your environment by referring to their official documentation.
Here are the key players:
* **sglang**: Your go-to for efficient large language model serving. Check out their setup guide here: [sglang](https://github.com/sgl-project/sglang) ✨
* **LLaMA-Factory**: A comprehensive and user-friendly framework for fine-tuning large language models. Dive into their documentation for installation details: [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) 🛠️
* **vLLM**: For high-throughput and low-latency inference. Find out how to get it running here: [vllm](https://github.com/vllm-project/vllm) ⚡
**Pro Tip:** We highly recommend creating a dedicated virtual environment (using tools like `conda` or `venv`) to manage the dependencies for this project. This helps keep your workspace clean and avoids conflicts with other Python projects. Happy configuring! 👩💻
📊 Results
Open-loop trajectory prediction L2 errors (m) on the nuScenes dataset.
| Method |
1s |
2s |
3s |
Avg. |
| Closed-source API-only Models |
| GPT-4o1 |
0.28 |
0.93 |
2.02 |
1.07 |
| Claude-3.5-Sonnet1 |
0.29 |
0.98 |
2.12 |
1.13 |
| Claude-3.7-Sonnet1 |
0.28 |
0.94 |
2.04 |
1.09 |
| Gemini-2.0-Flash1 |
0.31 |
1.08 |
2.36 |
1.25 |
| Gemini-2.5-Pro1 |
0.37 |
1.35 |
2.96 |
1.56 |
| Open-source Generalist VLMs |
| LLaVA-1.6-Mistral-7B2 |
1.49 |
3.38 |
4.09 |
2.98 |
| Llama-3.2-11B-Vision-Instruct2 |
1.54 |
3.31 |
3.91 |
2.92 |
| Qwen2-VL-7B-Instruct2 |
1.45 |
3.21 |
3.76 |
2.81 |
| DeepSeek-VL2-16B1 |
0.66 |
1.68 |
2.92 |
1.75 |
| DeepSeek-VL2-28B1 |
0.37 |
1.35 |
2.96 |
1.56 |
| LLaMA-3.2-11B-Vision-Instruct1 |
0.52 |
1.42 |
2.68 |
1.54 |
| LLaMA-3.2-90B-Vision-Instruct1 |
0.66 |
1.71 |
3.01 |
1.79 |
| Qwen-2.5-VL-7B-Instruct1 |
0.46 |
1.33 |
2.55 |
1.45 |
| Training-based Driving Specialists (Existing Methods) |
| UniAD3 |
0.42 |
0.64 |
0.91 |
0.66 |
| VAD3 |
0.17 |
0.34 |
0.60 |
0.37 |
| BEV-Planner3 |
0.16 |
0.32 |
0.57 |
0.35 |
| Ego-MLP3* |
0.15 |
0.32 |
0.59 |
0.35 |
| Ours and Key Competitors (Specialized Driving Models) |
| DriveVLM3 |
0.18 |
0.34 |
0.68 |
0.40 |
| OmniDrive3 |
0.14 |
0.29 |
0.55 |
0.33 |
| DriveVLM-Dual3 |
0.15 |
0.29 |
0.48 |
0.31 |
| EMMA (random init)3 |
0.15 |
0.33 |
0.63 |
0.37 |
| EMMA3 |
0.14 |
0.29 |
0.54 |
0.32 |
| EMMA+3 |
0.13 |
0.27 |
0.48 |
0.29 |
| 3B Base+nuScenes |
0.14 |
0.30 |
0.58 |
0.34 |
| 3B Base+Impromptu+nuScenes |
0.13 |
0.27 |
0.52 |
0.30 |
| 7B Base+nuScenes |
0.13 |
0.28 |
0.55 |
0.32 |
| 7B Base+Impromptu+nuScenes |
0.13 |
0.27 |
0.53 |
0.30 |
Note: Best results within each category are in
bold, second best are
underlined.
1 from
LightEMMA,
2 from
OpenEMMA,
3 from
EMMA.
Results on NeuroNCAP
| Source |
Method |
NeuroNCAP Score ↑ |
Collision rate (%) ↓ |
| Avg. |
Stat. |
Frontal |
Side |
Avg. |
Stat. |
Frontal |
Side |
| CVPR 2023 |
UniAD2 |
0.73 |
0.84 |
0.10 |
1.26 |
88.6 |
87.8 |
98.4 |
79.6 |
| ICCV 2023 |
VAD2 |
0.66 |
0.47 |
0.04 |
1.45 |
92.5 |
96.2 |
99.6 |
81.6 |
| ICRA 2025 |
SparseDrive1 |
0.92 |
- |
- |
- |
93.9 |
- |
- |
- |
| CVPR 2025 |
BridgeAD-S1 |
1.52 |
- |
- |
- |
76.2 |
- |
- |
- |
| CVPR 2025 |
BridgeAD-B1 |
1.60 |
- |
- |
- |
72.6 |
- |
- |
- |
| - |
Base+nuScenes |
1.77 |
1.80 |
1.67 |
1.75 |
72.5 |
68.0 |
73.0 |
71.5 |
| - |
Base+Impromptu+nuScenes |
2.15 |
1.77 |
2.31 |
2.10 |
65.5 |
70.0 |
59.0 |
65.0 |
Note: Best scores in each category are in
bold, second best are
underlined.
1 from
BridgeAD,
2 from
NeuRAD
The improvements in both the overall NeuroNCAP score and, crucially, the reduction in collision rates suggest
that our dataset helps the model develop a more nuanced understanding of complex road interactions, leading to
more robust and safer driving policies.
### 📥 Download Pre-trained Models
Pre-trained Models Download Links
| Method |
Download |
| 3B Base+nuScenes |
HF Hub |
| 3B Base+Impromptu |
HF Hub |
| 3B Base+Impromptu+nuScenes |
HF Hub |
| 7B Base+nuScenes |
HF Hub |
| 7B Base+Impromptu |
HF Hub |
| 7B Base+Impromptu+nuScenes |
HF Hub |
### 🚀 Model Training
To start training, simply run the following command:
```bash
llamafactory-cli train
```
Replace `` with the path to your training configuration file. For example:
```bash
llamafactory-cli train train/Qwen2_5-VL/QA_train_sub_fin_nu/3B_full_QA_train_bs8.yaml
```
This command will launch the training process based on the settings specified in your YAML config file. Make sure the path is correct and all necessary parameters are properly configured.
Training and testing data for nuScenes can be found in [nuscenes_train.json](nuscenes_train.json) and [nuscenes_test.json](nuscenes_train.json) respectively.
### 🧠 Inference
To run inference with a fine-tuned model, you need to use the following command:
```bash
python train/inference_scripts/sglang_infer.py --model_name_or_path --dataset --save_name --template qwen2_vl --tensor_parallel_size 1 --data_parallel_size 1
```
Replace the placeholders with your actual paths:
* ``: Name or path to the original pretrained model (e.g., Qwen2-VL-3B-Instruct)
* ``: dataset name in dataset_info.json folling [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)
* ``: Path to save inference results
### 🎯 Prompts
The prompts we use can be found in [prompts](prompts.md).
### 📊 Close-loop Evaluation with NeuroNCAP
To understand the system's performance within a closed-loop simulation environment, delve into the specifics of our NeuroNCAP-based evaluation: [Close-loop Evaluation](neuroncap_evaluation/evaluation.md) 🎮
### 🎬 Video Gallery
The videos compare the driving behavior of the two models in three representative challenging scenarios: stationary, frontal, and side. For each scenario, **the left column shows the behavior of the base model, which is fine-tuned on nuScenes. The right column shows the performance of the model trained on a subset of our proposed dataset and then fine-tuned on nuScenes**. Compared to the base model, the model using our data can better avoid vehicles by turning, slowing down, etc.
#### stationary
Base+nuScenes Base+Impromptu+nuScenes
#### side
Base+nuScenes Base+Impromptu+nuScenes
#### frontal
Base+nuScenes Base+Impromptu+nuScenes
