SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

Authors: Hung Nguyen, Quang Qui-Vinh Nguyen, Khoi Nguyen, Rang Nguyen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our approach outperforms current baselines, particularly in terms of video consistency and inference speed. ... Extensive experimental results demonstrate that our proposed Swift Try framework, leveraging these techniques, significantly outperforms existing video virtual try-on methods in both accuracy and efficiency. ... We evaluate our approach on the VVT dataset (Dong et al. 2019b) and our new Tik Tok Dress dataset. ... We conducted ablation studies on the VVT dataset to investigate various factors affecting the performance of Swift Try.
Researcher Affiliation Industry Vin AI Research, Vietnam EMAIL
Pseudocode No The paper describes the methodology and techniques, such as the 'Overall Architecture' and 'Shift Caching Technique', using descriptive text and diagrams (Figure 2, Figure 3, Figure 4), but it does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions comparing with other methods, noting that 'As most methods are closed-source, we rely on reported results and available generated videos for comparison.' It also states, 'We re-evaluate Vi Vi D (Fang et al. 2024) on the VVT dataset, as it is the only method with available inference code and pre-trained weights.' However, there is no explicit statement or link provided for the open-sourcing of the Swift Try method described in this paper.
Open Datasets Yes We evaluate our approach on the VVT dataset (Dong et al. 2019b) and our new Tik Tok Dress dataset. ... The VVT dataset, a standard benchmark for video virtual try-on, includes 791 paired videos of individuals and clothing images, with 661 for training and 130 for testing, all at 256x192 resolution. ... Additionally, we have introduced a new dataset, Tik Tok Dress, designed specifically for video virtual try-on.
Dataset Splits Yes The VVT dataset, a standard benchmark for video virtual try-on, includes 791 paired videos of individuals and clothing images, with 661 for training and 130 for testing, all at 256x192 resolution. ... It comprises 693 training videos and 124 testing videos at 540x720 resolution, totaling 232,843 frames for training and 39,705 frames for testing.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It discusses computational cost and inference speed but without specifying the underlying hardware.
Software Dependencies No The paper mentions several software components and models used or adapted, such as 'Stable Diffusion (Rombach et al. 2022)', 'DW-Pose (Yang et al. 2023)', 'Animate Anyone (Hu et al. 2023)', and 'Animate Diff (Guo et al. 2023)'. However, it does not specify version numbers for these or other ancillary software libraries or frameworks, which are necessary for reproducibility.
Experiment Setup Yes Implementation details: The training process is divided into two stages. In the first stage, we focus on inpainting and preserving detailed garment textures using the VITON-HD dataset (Choi et al. 2021). We fine-tune the Garment UNet, Pose Encoder, and Main UNet decoder, initializing the Main UNet and Garment UNet with pretrained weights from SD 1.5, while keeping the VAE Encoder, Decoder, and CLIP image encoder weights unchanged. In the second stage, we incorporate temporal attention layers into the previously trained model, initializing these new modules with pretrained weights from Animate Diff (Guo et al. 2023). ... Impact of Inference Video Chunk Length is examined in Tab. 7. The study reveals that matching the training and inference video chunk lengths both set to N = 16 yields the best results.