# TEMU-VTOFF
**Repository Path**: wowai/TEMU-VTOFF
## Basic Information
- **Project Name**: TEMU-VTOFF
- **Description**: https://github.com/davidelobba/TEMU-VTOFF.git
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-14
- **Last Updated**: 2025-07-14
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
TEMU-VTOFF
Text-Enhanced MUlti-category Virtual Try-Off
> **Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals**
> [Davide Lobba](https://scholar.google.com/citations?user=WEMoLPEAAAAJ&hl=en&oi=ao)1,2,\*, [Fulvio Sanguigni](https://scholar.google.com/citations?user=tSpzMUEAAAAJ&hl=en)2,3,\*, [Bin Ren](https://scholar.google.com/citations?user=Md9maLYAAAAJ&hl=en)1,2, [Marcella Cornia](https://scholar.google.com/citations?user=DzgmSJEAAAAJ&hl=en)3, [Rita Cucchiara](https://scholar.google.com/citations?user=OM3sZEoAAAAJ&hl=en)3, [Nicu Sebe](https://scholar.google.com/citations?user=stFCYOAAAAAJ&hl=en)1
> 1University of Trento, 2University of Pisa, 3University of Modena and Reggio Emilia
> * Equal contribution
Table of Contents
- About The Project
- Key Features
- Getting Started
- Inference
- Dataset Inference
- Contact
- Citation
## 💡 About The Project
TEMU-VTOFF is a novel dual-DiT (Diffusion Transformer) architecture designed for the Virtual Try-Off task: generating clean, in-shop images of garments worn by a person. By combining a pretrained feature extractor with a text-enhanced generation module, our method can handle occlusions, multiple garment categories, and ambiguous appearances. It further refines generation fidelity via a feature alignment module based on DINOv2.
## ✨ Key Features
Our contribution can be summarized as follows:
- **🎯 Multi-Category Try-Off**. We present a unified framework capable of handling multiple garment types (upper-body, lower-body, and full-body clothes) without requiring category-specific pipelines.
- **🔗 Multimodal Hybrid Attention**. We introduce a novel attention mechanism that integrates garment textual descriptions into the generative process by linking them with person-specific features. This helps the model synthesize occluded or ambiguous garment regions more accurately.
- **âš¡ Garment Aligner Module**. We design a lightweight aligner that conditions generation on clean garment images, replacing conventional denoising objectives. This leads to better alignment consistency on the overall dataset and preserves more precise visual retention.
- **📊 Extensive experiments**. Experiments on the Dress Code and VITON-HD datasets demonstrate that TEMU-VTOFF outperforms prior methods in both the quality of generated images and alignment with the target garment, highlighting its strong generalization capabilities.
## 💻 Getting Started
### Prerequisites
Clone the repository:
```sh
git clone https://github.com/davidelobba/TEMU-VTOFF.git
```
### Installation
1. We recommend installing the required packages using Python's native virtual environment (venv) as follows:
```sh
python -m venv venv
source venv/bin/activate
```
2. Upgrade pip and install dependencies
```sh
pip install --upgrade pip
pip install -r requirements.txt
```
3. Create a .env file like the following:
```js
export WANDB_API_KEY="ENTER YOUR WANDB TOKEN"
export HF_TOKEN="ENTER YOUR HUGGINGFACE TOKEN"
export HF_HOME="PATH WHERE YOU WANT TO SAVE THE HF MODELS"
```
🧠Note: Access to Stable Diffusion 3 Medium must be requested via [HuggingFace](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers).
## Inference
Let's generate the in-shop garment image.
```sh
source venv/bin/activate
source .env
python inference.py \
--pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
--pretrained_model_name_or_path_sd3_tryoff "davidelobba/TEMU-VTOFF" \
--seed 42 \
--width "768" \
--height "1024" \
--output_dir "put here the output path" \
--mixed_precision "bf16" \
--example_image "examples/example1.jpg" \
--guidance_scale 2.0 \
--num_inference_steps 28
```
## Dataset Inference
### Dataset Captioning
Generate textual descriptions for each sample using a multimodal VLM (e.g., Qwen2.5-VL).
```sh
python precompute_utils/captioning_qwen.py \
--pretrained_model_name_or_path "Qwen/Qwen2.5-VL-7B-Instruct" \
--dataset_name "dresscode" \
--dataset_root "put here your dataset path" \
--filename "qwen_captions_2_5.json"\
--temperatures 0.2
```
### Feature extraction
Extract textual features using OpenCLIP, CLIP and T5 text encoders.
```sh
phases=("test" "train")
for phase in "${phases[@]}"; do
python precompute_utils/precompute_text_features.py \
--pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
--dataset_name "dresscode" \
--dataset_root "put here your dataset path" \
--phase $phase \
--order "paired" \
--category "all" \
--output_dir "" \
--seed 42 \
--height 1024 \
--width 768 \
--batch_size 4 \
--mixed_precision "fp16" \
--num_workers 8 \
--text_encoders "T5" "CLIP" \
--captions_type "qwen_text_embeddings"
```
Extract visual features using OpenCLIP and CLIP vision encoders.
```sh
phases=("test" "train")
for phase in "${phases[@]}"; do
python precompute_utils/precompute_image_features.py \
--dataset "dresscode" \
--dataroot "put here your dataset path" \
--phase $phase \
--order "paired" \
--category "all" \
--seed 42 \
--height 1024 \
--width 768 \
--batch_size 4 \
--mixed_precision "fp16" \
--num_workers 8
```
### Generate Images
Let's generate the in-shop garment images of DressCode or VITON-HD using the TEMU-VTOFF model.
```sh
source venv/bin/activate
source .env
python inference_dataset.py \
--pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
--pretrained_model_name_or_path_sd3_tryoff "davidelobba/TEMU-VTOFF" \
--dataset_name "dresscode" \
--dataset_root "put here your dataset path" \
--output_dir "put here the output path" \
--coarse_caption_file "qwen_captions_2_5_0_2.json" \
--phase "test" \
--order "paired" \
--height "1024" \
--width "768" \
--mask_type bounding_box \
--category "all" \
--batch_size 4 \
--mixed_precision "bf16" \
--seed 42 \
--num_workers 8 \
--fine_mask \
--guidance_scale 2.0 \
--num_inference_steps 28
```
## 📬 Contact
**Lead Authors:**
- 📧 **Davide Lobba**: [davide.lobba@unitn.it](mailto:davide.lobba@unitn.it) | 🎓 [Google Scholar](https://scholar.google.com/citations?user=WEMoLPEAAAAJ&hl=en&oi=ao)
- 📧 **Fulvio Sanguigni**: [fulvio.sanguigni@unimore.it](mailto:fulvio.sanguigni@unimore.it) | 🎓 [Google Scholar](https://scholar.google.com/citations?user=tSpzMUEAAAAJ&hl=en)
For questions about the project, feel free to reach out to any of the lead authors!
## Citation
Please cite our paper if you find our work helpful:
```bibtex
@article{lobba2025inverse,
title={Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals},
author={Lobba, Davide and Sanguigni, Fulvio and Ren, Bin and Cornia, Marcella and Cucchiara, Rita and Sebe, Nicu},
journal={arXiv preprint arXiv:2505.21062},
year={2025}
}
```