OOTDiffusion: Outfitting Fusion Based Latent Diffusion for Controllable Virtual Try-On
Authors: Yuhao Xu, Tao Gu, Weifeng Chen, Arlene Chen
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive experiments on the VITON-HD and Dress Code datasets demonstrate that OOTDiffusion efficiently generates high-quality try-on results for arbitrary human and garment images, which outperforms other VTON methods in both realism and controllability, indicating a breakthrough in virtual try-on. ... We train our OOTDiffusion on two broadly-used high-resolution benchmark datasets, i.e., VITON-HD (Choi et al. 2021) and Dress Code (Morelli et al. 2022), respectively. Extensive qualitative and quantitative evaluations demonstrate our superiority over the state-of-the-art VTON methods in both realism and controllability for various target human and garment images (see Figure 1), implying an impressive breakthrough in image-based virtual try-on. |
| Researcher Affiliation | Industry | Yuhao Xu, Tao Gu, Weifeng Chen, Arlene Chen Xiao-i Research EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using text and diagrams (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/levihsu/OOTDiffusion |
| Open Datasets | Yes | Our experiments are performed on two high-resolution (1024 x 768) virtual try-on datasets, i.e., VITON-HD (Choi et al. 2021) and Dress Code (Morelli et al. 2022). |
| Dataset Splits | Yes | The VITON-HD dataset consists of 13,679 image pairs of frontal half-body models and corresponding upper-body garments, where 2032 pairs are used as the test set. The Dress Code dataset consists of 15,363/8,951/29,478 image pairs of full-body models and corresponding upper-body garments/lower-body garments/dresses, where 1,800 pairs for each garment category are used as the test set. |
| Hardware Specification | Yes | All the models are trained for 36,000 iterations on a single NVIDIA A100 GPU, with a batch size of 64 for the 512 x 384 resolution and 16 for the 1024 x 768 resolution. At inference time, we run our OOTDiffusion on a single NVIDIA RTX 4090 GPU for 20 sampling steps using the Uni PC sampler (Zhao et al. 2024). |
| Software Dependencies | No | The paper mentions "Stable Diffusion v1.5 (Rombach et al. 2022)" as inherited pretrained weights, and specific optimizers and samplers (Adam W optimizer, Uni PC sampler), but does not provide specific version numbers for underlying software libraries like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | In our experiments, we initialize the OOTDiffusion models by inheriting the pretrained weights of Stable Diffusion v1.5 (Rombach et al. 2022). Then we finetune the outfitting and denoising UNets using an Adam W optimizer (Loshchilov and Hutter 2018) with a fixed learning rate of 5e-5. ... All the models are trained for 36,000 iterations on a single NVIDIA A100 GPU, with a batch size of 64 for the 512 x 384 resolution and 16 for the 1024 x 768 resolution. ... And the optimal value of the guidance scale sg is usually around 1.5 - 2.0 according to our ablation study. ... we empirically set sg = 1.5 for the VITON-HD dataset (Choi et al. 2021) and sg = 2.0 for the Dress Code dataset (Morelli et al. 2022) in the following experiments. |