# Sa2VA **Repository Path**: ByteDance/Sa2VA ## Basic Information - **Project Name**: Sa2VA - **Description**: ๐Ÿ”ฅ Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-09-25 - **Last Updated**: 2026-06-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Pixel LLMs: Pixel-Level Grounded Understanding for Multimodal LLMs **Pixel LLMs** is a family of projects that bring pixel-level, dense grounded understanding to multimodal LLMs. It is anchored by **Sa2VA** โ€” a unified model that marries SAM-2 with LLaVA for dense grounded understanding of images and videos โ€” together with a growing set of research projects built on top of it. ![Teaser](assets/images/teaser.jpg) ## Projects ### ๐Ÿง  [Sa2VA](./projects/sa2va/README.md) โ€” Marrying SAM2 with LLaVA *Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang* The core unified model: SAM-2 + MLLM for referring segmentation, grounded conversation, visual prompting, and image/video chat. Supports InternVL2.5/3 and Qwen2.5-VL/Qwen3-VL backbones. ๐Ÿ“‚ [`projects/sa2va`](./projects/sa2va/README.md) ยท [๐Ÿ“œ arXiv](https://arxiv.org/abs/2501.04001) ยท [๐Ÿ  Page](https://lxtgh.github.io/project/sa2va) ยท [๐Ÿค— Models](https://huggingface.co/collections/ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093) ### ๐Ÿ” [VRT](./projects/vrt_sa2va/README.md) โ€” Visual Reasoning Tracer *Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, Ming-Hsuan Yang* Object-level grounded reasoning built on Sa2VA. Ships **VRT-Bench** (evaluation) and **VRT-80k** (training data). ๐Ÿ“‚ [`projects/vrt_sa2va`](./projects/vrt_sa2va/README.md) ยท [๐Ÿ“œ arXiv](https://arxiv.org/pdf/2512.05091) ยท [๐Ÿ  Page](https://harboryuan.github.io/visual-reasoning-tracer/) ยท [๐Ÿค— Data](https://huggingface.co/datasets/HarborYuan/VRT-Eval) ### ๐Ÿงฉ [SAMTok](./projects/samtok/README.md) โ€” Representing Any Mask with Two Words (CVPR 2026) *Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li* A unified mask-token interface that lets any MLLM generate and understand masks. ๐Ÿ“‚ [`projects/samtok`](./projects/samtok/README.md) ยท [๐Ÿ“œ arXiv](https://arxiv.org/abs/2601.16093) ยท [๐Ÿ  Page](https://zhouyiks.github.io/projects/SAMTok/) ยท [๐Ÿค— Models](https://huggingface.co/collections/zhouyik/samtok) ### Extensions - **[SaSaSa2VA](./projects/sasasa2va/README.md)** โ€” a segmentation-augmented extension of Sa2VA that won **1st place** in the ICCV 2025 LSVOS Challenge RVOS Track ๐Ÿ…. - **[Pixel-SAIL](./projects/pixel_sail/README.md)** โ€” single-transformer pixel-level grounding. ## Environment We manage dependencies with [`uv`](https://docs.astral.sh/uv/). Install it once: ```bash curl -LsSf https://astral.sh/uv/install.sh | sh ``` The environment is defined under [`projects/sa2va`](./projects/sa2va) (`pyproject.toml` + `uv.lock`) and shared across the projects. The quickest way to set it up โ€” with the virtualenv placed in `/tmp` and symlinked back into the project โ€” is the helper script at the repo root: ```bash bash setup_env.sh # projects/sa2va, --extra=latest # bash setup_env.sh sa2va legacy # InternVL2.5 or earlier ``` Or do it manually: ```bash cd projects/sa2va uv sync --extra=latest # or --extra=legacy ``` Then run training / evaluation from the repository root with the environment activated (`source projects/sa2va/.venv/bin/activate`). See each project's README for project-specific steps. For tokens / API keys (HuggingFace, OpenRouter), copy the template and fill it in โ€” `setup_env.sh` loads it automatically: ```bash cp .env.example .env # then edit .env ``` Why `uv`? It treats the environment as code: dependencies are declared in `pyproject.toml` and every transitive package is version-locked in `uv.lock`. The result is a single source of truth that is fully reproducible across machines, trivial to maintain, and recreated exactly with one `uv sync` โ€” no manual `pip install` drift. ## Citation If you find this repository useful, please consider citing the relevant papers: ```bibtex @article{sa2va, title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos}, author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Sun, Yueyi and Huang, Zilong and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan}, journal={arXiv pre-print}, year={2025} } @article{yuan2025vrt, title={Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark}, author={Yuan, Haobo and Sun, Yueyi and Li, Yanwei and Zhang, Tao and Deng, Xueqing and Ding, Henghui and Qi, Lu and Wang, Anran and Li, Xiangtai and Yang, Ming-Hsuan}, journal={arXiv pre-print}, year={2025} } @inproceedings{zhou2026samtok, title={SAMTok: Representing Any Mask with Two Words}, author={Zhou, Yikang and Zhang, Tao and Gong, Dengxian and Wu, Yuanzheng and Tian, Ye and Wang, Haochen and Yuan, Haobo and Wang, Jiacong and Qi, Lu and Fei, Hao and Wang, Anran and Wang, Zhuochen and Wang, Yujing and Chen, Cheng and Ji, Shunping and Li, Xiangtai}, booktitle={CVPR}, address={Denver, CO, USA}, year={2026} } ```