# 1008-T-VSL **Repository Path**: duchenyi/1008-t-vsl ## Basic Information - **Project Name**: 1008-T-VSL - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-08 - **Last Updated**: 2026-03-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # T-VSL: Text-Guided Visual Sound Source Localization in Mixtures (CVPR 2024) [Paper](https://openaccess.thecvf.com/content/CVPR2024/papers/Mahmud_T-VSL_Text-Guided_Visual_Sound_Source_Localization_in_Mixtures_CVPR_2024_paper.pdf) | [Supplementary](https://openaccess.thecvf.com/content/CVPR2024/supplemental/Mahmud_T-VSL_Text-Guided_Visual_CVPR_2024_supplemental.pdf) | [Arxiv](https://arxiv.org/abs/2404.01751) | [Video](https://www.youtube.com/watch?v=oKc8RwDHjsA) | [Poster](https://drive.google.com/file/d/1-RcoTY8aR8b9JbloNnGGCAAiNWCdOPyP/view?usp=drive_link)
by [Tanvir Mahmud](https://sites.google.com/view/tanvirmahmud), [Yapeng Tian](https://www.yapengtian.com/), [Diana Marculescu](https://www.ece.utexas.edu/people/faculty/diana-marculescu)
T-VSL incorporates the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures.
MA-AVT Illustration
## Environment To setup the environment, please simply run ``` pip install -r requirements.txt ``` ## Datasets ### MUSIC Data can be downloaded from [Sound of Pixels](https://github.com/roudimit/MUSIC_dataset) ### VGG-Instruments Data can be downloaded from [Mix and Localize: Localizing Sound Sources in Mixtures](https://github.com/hxixixh/mix-and-localize) ### VGG-Sound Source Data can be downloaded from [Localizing Visual Sounds the Hard Way](https://github.com/hche11/Localizing-Visual-Sounds-the-Hard-Way) ## Train For training the T-VSL model, please run ``` python main.py --train_data_path ./data/vggsound \ --mode train --test_data_path ./data/vggsound \ --test_gt_path ./metadata/vggsound_duet_test.csv \ --output_dir ./path/to/output/dir \ --id vggsound_duet --model tvsl \ --trainset vggsound_duet --num_class 221 \ --testset vggsound_duet --epochs 100 \ --batch_size 256 --init_lr 0.01 \ --lr_schedule cos --multiprocessing_distributed \ --ngpu 4 --port 11342 --ciou_thr 0.3 \ --iou_thr 0.3 --save_visualizations \ --audioclip_ckpt_path ./path/to/audioclip/pretrained/ckpt ``` ## Test For testing and visualization, simply run ``` python main.py --mode test \ --train_data_path ./data/vggsound \ --test_data_path ./data/vggsound \ --test_gt_path ./metadata/vggsound_duet_test.csv \ --output_dir ./path/to/output/dir \ --id vggsound_duet --model tvsl \ --trainset vggsound_duet --num_class 221 \ --testset vggsound_duet --epochs 100 \ --batch_size 256 --init_lr 0.01 \ --lr_schedule cos --multiprocessing_distributed \ --ngpu 4 --port 11342 --ciou_thr 0.3 \ --iou_thr 0.3 --save_visualizations \ --load /path/to/pretrained/ckpt \ --audioclip_ckpt_path ./path/to/audioclip/pretrained/ckpt ``` ## 👍 Acknowledgments This codebase is based on [AVGN](https://github.com/stoneMo/AVGN) and [AudioCLIP](https://github.com/AndreyGuzhov/AudioCLIP). Thanks for their amazing works. ## LICENSE T-VSL is licensed under a [UT Austin Research LICENSE](./LICENSE). ## Citation If you find this work useful, please consider citing our paper: ## BibTeX ```bibtex @inproceedings{mahmud2024t, title={T-vsl: Text-guided visual sound source localization in mixtures}, author={Mahmud, Tanvir and Tian, Yapeng and Marculescu, Diana}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={26742--26751}, year={2024} } ``` ## Contributors