reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

Authors: Ente Lin, Xujie Zhang, Fuwei Zhao, Yuxuan Luo, Xin Dong, Long Zeng, Xiaodan Liang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive experiments on both 768 512 high-resolution benchmarks and in-the-wild images. Dream Fit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation. Extensive experiments on open and internal benchmarks of 768 512 resolution verify the superiority of Dream Fit, demonstrating state-of-the-art performance and robust generalization in diverse human generation tasks.
Researcher Affiliation	Collaboration	1Shenzhen International Graduate School, Tsinghua University 2Shenzhen Campus of Sun Yat-sen University 3 Byte Dance 4Sun Yat-sen University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical formulations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code, nor does it include a link to a code repository. It mentions "Dream Fit is engineered for smooth integration with any community control plugins for diffusion models" and refers to using "released pretrained models" for baselines, but not its own implementation.
Open Datasets	Yes	The open benchmark is constructed using a subset of VITONHD (Choi et al. 2021) and Dress Code (Morelli et al. 2022) test sets.
Dataset Splits	No	To train Dreamfit, we collected approximately 500,000 garment-person image pairs from the internet and captioned them using large multi-modal models. For model evaluation, we introduce two garment-centric human generation benchmarks derived from public datasets and the Internet. The open benchmark is constructed using a subset of VITONHD (Choi et al. 2021) and Dress Code (Morelli et al. 2022) test sets. Specifically, we handpicked 200 diverse garments from these datasets encompassing various styles, colors, shapes, and textures.
Hardware Specification	Yes	The training was conducted on 8 A800 (40G) GPUs for 90k steps, with a batch size of 4 per GPU. To validate scalability, we also initialized the denoising UNet as SDXL and trained the model on 8 A100 (80G) GPUs for 90k steps with the same batch size.
Software Dependencies	No	The paper mentions using specific models and optimizers (e.g., CLIP Vi T-L/14, AdamW optimizer, Cog VLM, DDIM sampler) but does not provide specific version numbers for any software libraries, programming languages, or development environments used.
Experiment Setup	Yes	The denoising UNet is initialized with the weights of SD1.5 and we use CLIP Vi T-L/14 (Radford et al. 2021) as the text encoder. Our model was trained on paired images with a resolution of 768 512. We initialized the Lo RA layers in the same manner as described in (Hu et al. 2021), with the Lo RA rank set to 64. The training was conducted on 8 A800 (40G) GPUs for 90k steps, with a batch size of 4 per GPU. We utilized the Adam W optimizer with a fixed learning rate of 1e-4. During inference, we use Cog VLM (Wang et al. 2023) to refine the user input text. We use DDIM (Song, Meng, and Ermon 2020) sampler with 50 steps and set guidance scale w to 7.5.