Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better
Authors: Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Shuaiqi Wang, Matthew Blaschko, Sergey Yekhanin, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the value of LCSC through two use cases: (a) Reducing training cost. With LCSC, we only need to train DM/CM with fewer number of iterations and/or lower batch sizes to obtain comparable sample quality with the fully trained model. For example, LCSC achieves considerable training speedups for CM (23 on CIFAR-10 and 15 on Image Net-64). (b) Enhancing pre-trained models. When full training is already done, LCSC can further improve the generation quality or efficiency of the final converged models. For example, LCSC achieves better FID using 1 number of function evaluation (NFE) than the base model with 2 NFE on consistency distillation, and decreases the NFE of DM from 15 to 9 while maintaining the generation quality. Applying LCSC to large text-to-image models, we also observe clearly enhanced generation quality. |
| Researcher Affiliation | Collaboration | Enshu Liu 1,2, Junyi Zhu 3, Zinan Lin 4, Xuefei Ning 1, Shuaiqi Wang5, Matthew B. Blaschko3, Sergey Yekhanin4, Shengen Yan2, Guohao Dai2,6, Huazhong Yang1, Yu Wang 1 1Tsinghua University 2Infinigence-AI 3KU Leuven 4Microsoft Research 5Carnegie Mellon University 6Shanghai Jiao Tong University |
| Pseudocode | Yes | Algorithm 1 Evolutionary Search for Combination Coefficients Optimization |
| Open Source Code | No | The paper refers to existing codebases like DDIM (https: //github.com/ermongroup/ddim), iDDPM (https: //github.com/openai/improved-diffusion), DPM-Solver (https://github.com/Lu Cheng THU/dpm-solver), and official CM code (https://github.com/openai/consistency_models_cifar10) that were used in their experiments. However, it does not provide an explicit statement or link for the open-sourcing of their own methodology, LCSC. |
| Open Datasets | Yes | For DM, we follow DDIM (Song et al., 2020a) for the evaluation on CIFAR10 (Krizhevsky et al., 2009) and i DDPM (Nichol & Dhariwal, 2021) for the evaluation on Image Net-64 (Deng et al., 2009). For CM (Song et al., 2023), we evaluate LCSC with both CD and CT on CIFAR-10 and Image Net-64, and CT on LSUN datasets. For text-to-image task, we fine-tune a Lo RA based on the Stable Diffusion v1-5 model (Rombach et al., 2022) on CC12M dataset and use 1k extra image-text pair in this dataset to conduct search. We test the search result using 10k data in MS-COCO dataset with FID, Pick Score, and Image Reward. |
| Dataset Splits | No | The paper mentions using well-known datasets like CIFAR-10, Image Net-64, LSUN, CC12M, and MS-COCO, which typically have predefined splits. It also mentions computing FID using the 'training set as the ground truth' and 'additional metrics based on the test set'. However, the paper does not explicitly state the specific percentages, sample counts, or a citation to the exact split methodology used for its experiments, which is necessary for reproducibility of the data partitioning. |
| Hardware Specification | Yes | All experiments are performed on a single NVIDIA A100 GPU, paired with an Intel Xeon Platinum 8385P CPU. |
| Software Dependencies | No | The paper mentions using PyTorch for its experiments and refers to several third-party codebases (DDIM, iDDPM, DPM-Solver, official CM code). However, it does not provide specific version numbers for PyTorch or any other software dependencies, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | On Image Net-64, we decrease the batch size to 256 on CM and 512 on DM due to the limited resources. For text-to-image task, we fine-tune a Lo RA based on the Stable Diffusion v1-5 model (Rombach et al., 2022) on CC12M dataset. We use a batch size of 12 and train the Lo RA for 6k steps. For CM, checkpoints have a window size of 40K with an interval of 100 on CIFAR10, and a window size of 20K with an interval of 100 on Image Net-64 and LSUN datasets. DM is assigned a window size of 50K and an interval of 200 for both datasets. An evolutionary search spanning 2K iterations is applied consistently across all experimental setups. |