# TEMU-VTOFF **Repository Path**: wowai/TEMU-VTOFF ## Basic Information - **Project Name**: TEMU-VTOFF - **Description**: https://github.com/davidelobba/TEMU-VTOFF.git - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-14 - **Last Updated**: 2025-07-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

TEMU-VTOFF

Text-Enhanced MUlti-category Virtual Try-Off

MiniMax
> **Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals** > [Davide Lobba](https://scholar.google.com/citations?user=WEMoLPEAAAAJ&hl=en&oi=ao)1,2,\*, [Fulvio Sanguigni](https://scholar.google.com/citations?user=tSpzMUEAAAAJ&hl=en)2,3,\*, [Bin Ren](https://scholar.google.com/citations?user=Md9maLYAAAAJ&hl=en)1,2, [Marcella Cornia](https://scholar.google.com/citations?user=DzgmSJEAAAAJ&hl=en)3, [Rita Cucchiara](https://scholar.google.com/citations?user=OM3sZEoAAAAJ&hl=en)3, [Nicu Sebe](https://scholar.google.com/citations?user=stFCYOAAAAAJ&hl=en)1 > 1University of Trento, 2University of Pisa, 3University of Modena and Reggio Emilia > * Equal contribution
Paper PDF webpage Project License
Table of Contents
  1. About The Project
  2. Key Features
  3. Getting Started
  4. Inference
  5. Dataset Inference
  6. Contact
  7. Citation
## 💡 About The Project TEMU-VTOFF is a novel dual-DiT (Diffusion Transformer) architecture designed for the Virtual Try-Off task: generating clean, in-shop images of garments worn by a person. By combining a pretrained feature extractor with a text-enhanced generation module, our method can handle occlusions, multiple garment categories, and ambiguous appearances. It further refines generation fidelity via a feature alignment module based on DINOv2. ## ✨ Key Features Our contribution can be summarized as follows: - **🎯 Multi-Category Try-Off**. We present a unified framework capable of handling multiple garment types (upper-body, lower-body, and full-body clothes) without requiring category-specific pipelines. - **🔗 Multimodal Hybrid Attention**. We introduce a novel attention mechanism that integrates garment textual descriptions into the generative process by linking them with person-specific features. This helps the model synthesize occluded or ambiguous garment regions more accurately. - **⚡ Garment Aligner Module**. We design a lightweight aligner that conditions generation on clean garment images, replacing conventional denoising objectives. This leads to better alignment consistency on the overall dataset and preserves more precise visual retention. - **📊 Extensive experiments**. Experiments on the Dress Code and VITON-HD datasets demonstrate that TEMU-VTOFF outperforms prior methods in both the quality of generated images and alignment with the target garment, highlighting its strong generalization capabilities. ## 💻 Getting Started ### Prerequisites Clone the repository: ```sh git clone https://github.com/davidelobba/TEMU-VTOFF.git ``` ### Installation 1. We recommend installing the required packages using Python's native virtual environment (venv) as follows: ```sh python -m venv venv source venv/bin/activate ``` 2. Upgrade pip and install dependencies ```sh pip install --upgrade pip pip install -r requirements.txt ``` 3. Create a .env file like the following: ```js export WANDB_API_KEY="ENTER YOUR WANDB TOKEN" export HF_TOKEN="ENTER YOUR HUGGINGFACE TOKEN" export HF_HOME="PATH WHERE YOU WANT TO SAVE THE HF MODELS" ``` 🧠 Note: Access to Stable Diffusion 3 Medium must be requested via [HuggingFace](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers). ## Inference Let's generate the in-shop garment image. ```sh source venv/bin/activate source .env python inference.py \ --pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \ --pretrained_model_name_or_path_sd3_tryoff "davidelobba/TEMU-VTOFF" \ --seed 42 \ --width "768" \ --height "1024" \ --output_dir "put here the output path" \ --mixed_precision "bf16" \ --example_image "examples/example1.jpg" \ --guidance_scale 2.0 \ --num_inference_steps 28 ``` ## Dataset Inference ### Dataset Captioning Generate textual descriptions for each sample using a multimodal VLM (e.g., Qwen2.5-VL). ```sh python precompute_utils/captioning_qwen.py \ --pretrained_model_name_or_path "Qwen/Qwen2.5-VL-7B-Instruct" \ --dataset_name "dresscode" \ --dataset_root "put here your dataset path" \ --filename "qwen_captions_2_5.json"\ --temperatures 0.2 ``` ### Feature extraction Extract textual features using OpenCLIP, CLIP and T5 text encoders. ```sh phases=("test" "train") for phase in "${phases[@]}"; do python precompute_utils/precompute_text_features.py \ --pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \ --dataset_name "dresscode" \ --dataset_root "put here your dataset path" \ --phase $phase \ --order "paired" \ --category "all" \ --output_dir "" \ --seed 42 \ --height 1024 \ --width 768 \ --batch_size 4 \ --mixed_precision "fp16" \ --num_workers 8 \ --text_encoders "T5" "CLIP" \ --captions_type "qwen_text_embeddings" ``` Extract visual features using OpenCLIP and CLIP vision encoders. ```sh phases=("test" "train") for phase in "${phases[@]}"; do python precompute_utils/precompute_image_features.py \ --dataset "dresscode" \ --dataroot "put here your dataset path" \ --phase $phase \ --order "paired" \ --category "all" \ --seed 42 \ --height 1024 \ --width 768 \ --batch_size 4 \ --mixed_precision "fp16" \ --num_workers 8 ``` ### Generate Images Let's generate the in-shop garment images of DressCode or VITON-HD using the TEMU-VTOFF model. ```sh source venv/bin/activate source .env python inference_dataset.py \ --pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \ --pretrained_model_name_or_path_sd3_tryoff "davidelobba/TEMU-VTOFF" \ --dataset_name "dresscode" \ --dataset_root "put here your dataset path" \ --output_dir "put here the output path" \ --coarse_caption_file "qwen_captions_2_5_0_2.json" \ --phase "test" \ --order "paired" \ --height "1024" \ --width "768" \ --mask_type bounding_box \ --category "all" \ --batch_size 4 \ --mixed_precision "bf16" \ --seed 42 \ --num_workers 8 \ --fine_mask \ --guidance_scale 2.0 \ --num_inference_steps 28 ``` ## 📬 Contact **Lead Authors:** - 📧 **Davide Lobba**: [davide.lobba@unitn.it](mailto:davide.lobba@unitn.it) | 🎓 [Google Scholar](https://scholar.google.com/citations?user=WEMoLPEAAAAJ&hl=en&oi=ao) - 📧 **Fulvio Sanguigni**: [fulvio.sanguigni@unimore.it](mailto:fulvio.sanguigni@unimore.it) | 🎓 [Google Scholar](https://scholar.google.com/citations?user=tSpzMUEAAAAJ&hl=en) For questions about the project, feel free to reach out to any of the lead authors! ## Citation Please cite our paper if you find our work helpful: ```bibtex @article{lobba2025inverse, title={Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals}, author={Lobba, Davide and Sanguigni, Fulvio and Ren, Bin and Cornia, Marcella and Cucchiara, Rita and Sebe, Nicu}, journal={arXiv preprint arXiv:2505.21062}, year={2025} } ```