# LightEMMA **Repository Path**: flashdxy/LightEMMA ## Basic Information - **Project Name**: LightEMMA - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-05-08 - **Last Updated**: 2025-06-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # LightEMMA: Lightweight End-to-end Multimodal Autonomous Driving [![arXiv](https://img.shields.io/badge/arXiv-2505.00284-red.svg)](https://arxiv.org/abs/2505.00284) [![nuScenes](https://img.shields.io/badge/dataset-nuScenes-green.svg)](https://www.nuscenes.org/nuscenes) [![Hugging Face](https://img.shields.io/badge/🤗-Huggingface-yellow.svg)](https://huggingface.co/) **LightEMMA** is an open-loop, end-to-end autonomous driving framework designed to leverage the zero-shot capabilities of vision-language models (VLMs). Its primary task is the prediction of driving actions and trajectories, evaluated using real-world driving scenarios from the nuScenes dataset. The framework integrates multiple state-of-the-art VLMs, including **GPT**, **Claude**, **Gemini**, **Qwen**, **DeepSeek**, **LLaMA**. ![My Image](images/architecture.png) ## Table of Contents - [Overview](#overview) - [Features](#features) - [Supported Models](#supported-models) - [Environment Setup](#environment-setup) - [Prerequisites](#prerequisites) - [Installation](#installation) - [Code Structure](#code-structure) - [Dataset Preparation](#dataset-preparation) - [Configuration](#configuration) - [Usage](#usage) - [Run Predictions](#run-predictions) - [Run Baseline](#run-baseline) - [Evaluation](#evaluation) - [Comparing Models](#comparing-models) - [Pre-generated Outputs](#pre-generated-outputs) - [License](#license) - [Citation](#citation) ## Overview LightEMMA processes front-camera images from the nuScenes dataset and leverages VLMs using a chain-of-thought (CoT) reasoning approach. 1. **Scene Description**: Generate detailed descriptions of the driving environment, including lane markings, traffic lights, vehicles, pedestrian activities, and other pertinent objects. 2. **Driving Intent Analysis**: Understand the driving intent from the current scene and the ego vehicle's historical actions to predict the subsequent high-level driving maneuver. 3. **Trajectory Prediction**: Convert high-level driving intentions into low-level driving actions, specifying precise speed and curvature values. This end-to-end approach allows LightEMMA to leverage the rich semantic understanding capabilities of VLMs for zero-shot autonomous driving. ![My Image](images/scene-0079.gif) ## Features 1. **Model Extensibility**: Supports 12 state-of-the-art VLMs, with an extensible framework that enables easy integration of additional models. 2. **Computational Metrics**: Extensive computational analysis including ininference time, cost, token usage, and hardware specifications. 3. **Benchmark Compatibility**: Standard L2 error analysis compatible with the nuScenes benchmark, enabling direct comparison and visualization. ### Supported Models LightEMMA supports the following VLMs (we recommend not using DeepSeek-VL due to its poor performance and environmental incompatibility): **Commercial API-based models**: - [`gpt-4o`](https://platform.openai.com/docs/models/gpt-4o): OpenAI GPT-4o - [`gpt-4.1`](https://platform.openai.com/docs/models/gpt-4.1): OpenAI GPT-4.1 - [`claude-3.7`](https://docs.anthropic.com/claude/docs/models-overview): Anthropic Claude-3.7-Sonnet - [`claude-3.5`](https://docs.anthropic.com/claude/docs/models-overview): Anthropic Claude-3.5-Sonnet - [`gemini-2.5`](https://ai.google.dev/models/gemini): Google Gemini-2.5-Pro - [`gemini-2.0`](https://ai.google.dev/models/gemini): Google Gemini-2.0-Flash **Open-source Local models**: - [`qwen2.5-7b`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct): Qwen2.5-VL-7B-Instruct - [`qwen2.5-72b`](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct): Qwen2.5-VL-72B-Instruct - [`llama-3.2-11b`](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct): Llama-3.2-11B-Vision-Instruct - [`llama-3.2-90b`](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct): Llama-3.2-90B-Vision-Instruct - [`deepseek-vl2-16b`](https://huggingface.co/deepseek-ai/deepseek-vl2-small): deepseek-vl2-small - [`deepseek-vl2-28b`](https://huggingface.co/deepseek-ai/deepseek-vl2): deepseek-vl2 ## Environment Setup ### Recommended Settings - **System:** Ubuntu 22.04 - **CUDA:** 12.4 - **Python:** 3.10 - **Commercial Models:** Active API keys required. - **Open-source Models:** High-performance, high-memory CUDA-compatible GPUs (e.g., NVIDIA L40, H100). Multi-GPU setups advised for scaling large models and accelerating inference. ### Installation 1. Clone the repository: ```bash git clone https://github.com/michigan-traffic-lab/LightEMMA.git ``` 2. Create and activate a conda environment: ```bash cd LightEmma conda create -n lightemma python=3.10 conda activate lightemma ``` 3. Install LightEMMA and its dependencies: ```bash pip install -e . ``` ## Code Structure - `predict.py`: Main script for generating predictions - `evaluate.py`: Evaluates prediction results for a single model - `baseline.py`: Evaluates constant velocity baseline predictions - `evaluate_all.py`: Processes results from multiple models for comparison - `vlm.py`: Utility functions to run vision-language models - `utils.py`: Utility functions for trajectory calculations and visualization - `setup.py`: Package installation and dependency configuration ### Dataset Preparation Register an account on the [nuScenes website](https://www.nuscenes.org/nuscenes). Scroll down to download the **Full dataset (v1.0)** and extract it to your desired location. We recommend starting with `v1.0-mini` for initial experiments, then proceeding to a full evaluation using `v1.0-test`. All experiments in this evaluation use the **US** dataset version. ### Configuration Configure your API keys and dataset paths in `config.yaml`: ```yaml # API Keys api_keys: openai: "your-openai-api-key" # GPT models anthropic: "your-anthropic-api-key" # Claude models gemini: "your-gemini-api-key" # Gemini models huggingface: "your-huggingface-api-key" # To download models from HuggingFace model_path: huggingface: "path/to/huggingface/models" # directory to store downloaded models # Dataset data: root: "/path/to/nuscenes" # Path to nuScenes dataset version: "v1.0-mini" # Dataset version (v1.0-mini, v1.0-trainval, v1.0-test) results: "results" # Directory to store results # Prediction Parameters (do not change) prediction: obs_len: 6 # Number of past observations (frames) fut_len: 6 # Number of future predictions (frames) ext_len: 2 # Extra length for calculations (frames) ``` ## Usage ### Run Predictions We use GPT-4o in the following examples, but you can substitute it with any model listed in the previous section. To run prediction on all scenes: ```bash python predict.py --model gpt-4o --all_scenes ``` To run a specific scene: ```bash python predict.py --model gpt-4o --scene scene-0103 ``` To continue from a previous run: ```bash python predict.py --model gpt-4o --continue_dir results/gpt-4o_20250415-123 ``` ### Run Baseline (Optional) To establish a constant velocity baseline: ```bash python baseline.py ``` ### Evaluation To evaluate the performance of a single model: ```bash python evaluate.py --results_dir results/gpt-4o_20250415-123 ``` If visualization is not needed (faster): ```bash python evaluate.py --results_dir results/gpt-4o_20250415-123 --no_vis ``` ### Comparing Models To process evaluation results from multiple models and generate comparative analysis: ```bash python evaluate_all.py ``` ### Pre-generated Outputs We have uploaded all results generated in this paper into the results folder as JSON files, with each file corresponding to one scenario. Since the image files on your system are stored in a different location than the paths indicated in the JSON files, you will need to execute the following command to correctly generate visualizations. If you prefer not to generate visualizations, simply run the command with the `--no_vis` flag. ```bash python evaluate.py --results_dir results/gpt-4o_20250415-123 --local_samples_path /path/to/nuscenes/samples ``` ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Citation If you use LightEMMA in your research, please consider citing: ```bibtex @article{lightemma, title={LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving}, author={Zhijie Qiao and Haowei Li and Zhong Cao and Henry X. Liu}, year={2025}, eprint={2505.00284}, url={https://arxiv.org/abs/2505.00284}, } ```