DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder
Authors: Ente Lin, Xujie Zhang, Fuwei Zhao, Yuxuan Luo, Xin Dong, Long Zeng, Xiaodan Liang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments on both 768 512 high-resolution benchmarks and in-the-wild images. Dream Fit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation. Extensive experiments on open and internal benchmarks of 768 512 resolution verify the superiority of Dream Fit, demonstrating state-of-the-art performance and robust generalization in diverse human generation tasks. |
| Researcher Affiliation | Collaboration | 1Shenzhen International Graduate School, Tsinghua University 2Shenzhen Campus of Sun Yat-sen University 3 Byte Dance 4Sun Yat-sen University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical formulations, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code, nor does it include a link to a code repository. It mentions "Dream Fit is engineered for smooth integration with any community control plugins for diffusion models" and refers to using "released pretrained models" for baselines, but not its own implementation. |
| Open Datasets | Yes | The open benchmark is constructed using a subset of VITONHD (Choi et al. 2021) and Dress Code (Morelli et al. 2022) test sets. |
| Dataset Splits | No | To train Dreamfit, we collected approximately 500,000 garment-person image pairs from the internet and captioned them using large multi-modal models. For model evaluation, we introduce two garment-centric human generation benchmarks derived from public datasets and the Internet. The open benchmark is constructed using a subset of VITONHD (Choi et al. 2021) and Dress Code (Morelli et al. 2022) test sets. Specifically, we handpicked 200 diverse garments from these datasets encompassing various styles, colors, shapes, and textures. |
| Hardware Specification | Yes | The training was conducted on 8 A800 (40G) GPUs for 90k steps, with a batch size of 4 per GPU. To validate scalability, we also initialized the denoising UNet as SDXL and trained the model on 8 A100 (80G) GPUs for 90k steps with the same batch size. |
| Software Dependencies | No | The paper mentions using specific models and optimizers (e.g., CLIP Vi T-L/14, AdamW optimizer, Cog VLM, DDIM sampler) but does not provide specific version numbers for any software libraries, programming languages, or development environments used. |
| Experiment Setup | Yes | The denoising UNet is initialized with the weights of SD1.5 and we use CLIP Vi T-L/14 (Radford et al. 2021) as the text encoder. Our model was trained on paired images with a resolution of 768 512. We initialized the Lo RA layers in the same manner as described in (Hu et al. 2021), with the Lo RA rank set to 64. The training was conducted on 8 A800 (40G) GPUs for 90k steps, with a batch size of 4 per GPU. We utilized the Adam W optimizer with a fixed learning rate of 1e-4. During inference, we use Cog VLM (Wang et al. 2023) to refine the user input text. We use DDIM (Song, Meng, and Ermon 2020) sampler with 50 steps and set guidance scale w to 7.5. |