Truncated Consistency Models

Authors: Sangyun Lee, Yilun Xu, Tomas Geffner, Giulia Fanti, Karsten Kreis, Arash Vahdat, Weili Nie

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on CIFAR-10 and Image Net 64 64 datasets show that our method achieves better one-step and two-step FIDs than the state-of-the-art consistency models such as i CT-deep, using more than 2 smaller networks.
Researcher Affiliation Collaboration Sangyun Lee Carnegie Mellon University Yilun Xu NVIDIA Tomas Geffner NVIDIA Giulia Fanti Carnegie Mellon University Karsten Kreis NVIDIA Arash Vahdat NVIDIA Weili Nie NVIDIA
Pseudocode Yes Algorithm 1 Truncated Consistency Training
Open Source Code Yes Project page: https://truncated-cm.github.io/
Open Datasets Yes We evaluate TCM on the CIFAR-10 (Krizhevsky et al., 2009) and Image Net 64 64 (Deng et al., 2009) datasets. To show the scalability of our method, we train TCM on COYO dataset 1, using consistency distillation with a fixed classifier-free guidance (Ho & Salimans, 2022) scale of 6. We initialize our models with stable diffusion (Rombach et al., 2022) 1.5. We use a batch size of 512 for a quick validation, though using a larger batch size ( 1, 024) is standard (Liu et al., 2023; Yin et al., 2024a) and would lead to better generative performance. For the first stage, we train for 80,000 iterations (after which FID starts to increase), and in the second stage, we additionally train for another 200,000 iterations. We provide visual comparison between the standard consistency model and TCM in Fig. 6. Captions used are: "A photo of an astronaut riding a horse on Mars", "Robot serving dinner, metallic textures, futuristic atmosphere, high-tech kitchen, elegant plating, intricate details, high quality, misc-architectural style, warm and inviting lighting", and "A photo of a dog" for each row. We also measure the FID on MSCOCO dataset (Lin et al., 2014) in Table. 5. We see that TCM achieves a better FID than the standard consistency model (the first stage). 1https://github.com/kakaobrain/coyo-dataset
Dataset Splits No The paper mentions batch sizes and training iterations (e.g., "On CIFAR-10, we use a batch size of 512 and 1024 for the first and the second stage, respectively.") and refers to standard benchmark datasets, but does not provide specific training/test/validation dataset splits, percentages, or explicit methodologies for data partitioning.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It mentions model architectures like "EDM2-S" and "EDM2-XL" and discusses "memory cost" but these are not hardware specifications. For example, it states: "We observe that on Image Net 64 64 with EDM2-S, TCMs have an 18% increase in training time per iteration and an 15% increase in memory cost."
Software Dependencies No The paper does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes For consistency training in TCM, we mostly follow the hyperparameters in ECT (Geng et al., 2024), including the discretization curriculum and continuous-time training schedule. For all experiments, we choose a dividing time t = 1 and set ψt to the log-Student-t distribution. We use wb = 0.1 and ρ = 0.25 for the boundary loss. [...] On CIFAR-10, we use a batch size of 512 and 1024 for the first and the second stage, respectively. On Image Net with EDM2-S architecture, we use a batch size of 2048 and 1024 for the first and the second stage, respectively. For EDM2-XL, to save compute, we initialize the truncated training stage with the pre-trained checkpoint from the ECM work (Geng et al., 2024) that performs the standard consistency training, and conduct the second-stage training with a batch size of 1024. Training details: We set t = (1 + 8 sigmoid( t))(1 r)t, where r = max{1 1/2 i/25000 , 0.999} for CIFAR-10 and max{1 1/4 i/25000 , 0.9961} for Image Net 64 64, with i being the training iteration. For CIFAR-10, we train for 250K iterations in Stage 1 and 200K iterations in Stage 2. For Image Net 64 64, EDM2-S is trained for 150K iterations in Stage 1 and 120K iterations in Stage 2, while EDM2-XL is trained for 40K iterations in Stage 2 only. [...] The weighting function ω(t) is set to 1 for CIFAR-10 and t/cout(t)2 for Image Net 64 64. As suggested by Song & Dhariwal (2023); Geng et al. (2024), we use the Pseudo-Huber loss function d(x, y) = p ||x y||2 2 + c2 c, with c = 1e 8 for CIFAR-10 and c = 0.06 for Image Net 64 64. [...] For Image Net 64 64, we employ mixed-precision training with dynamic loss scaling and use power function EMA (Karras et al., 2024) with γ = 6.94 (without post-hoc EMA search). Learning rate schedules: EDM2 (Karras et al., 2024) architectures require a manual decay of the learning rate. Karras et al. (2024) suggest using the inverse square root schedule αref max(t/tref,1). For the first stage training of EDM2-S on Image Net, we use tref = 2000 and αref = 1e 3 following Geng et al. (2024). For the second stage training of EDM2-S, we use tref = 8000 and αref = 5e 4. Second stage training of EDM2-XL is initialized with the ECM2-XL checkpoint from Geng et al. (2024). During the second stage, we use tref = 8000 and αref = 1e 4 for EDM2-XL.