Inductive Moment Matching

Authors: Linqi Zhou, Stefano Ermon, Jiaming Song

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate IMM s empirical performance (Section 7.1), training stability (Section 7.2), sampling choices (Section 7.3), scaling behavior (Section 7.4) and ablate our practical decisions (Section 7.5). We present FID (Heusel et al., 2017) results for unconditional CIFAR-10 and class-conditional Image Net-256 256 in Table 1 and 2.
Researcher Affiliation Collaboration 1Luma AI 2Stanford University. Correspondence to: Linqi Zhou <EMAIL>.
Pseudocode Yes Algorithm 1 Training (see Appendix D for full version) [...] Algorithm 2 Pushforward Sampling (details in Appendix F)
Open Source Code No The paper does not contain an explicit statement about releasing their source code, nor does it provide a link to a code repository for the methodology described.
Open Datasets Yes Generated samples on Image Net-256 256 using 8 steps. [...] On CIFAR-10, IMM similarly achieves state-of-the-art of 1.98 FID with 2-step generation for a model trained from scratch.
Dataset Splits Yes We present FID (Heusel et al., 2017) results for unconditional CIFAR-10 and class-conditional Image Net-256 256 in Table 1 and 2. The paper reports FID-50K, implying the use of standard evaluation protocols and established dataset splits for these benchmark datasets.
Hardware Specification No The paper mentions 'Model GFLOPs. We reuse numbers from Di T (Peebles & Xie, 2023) for each model architecture.' which describes model complexity, but does not provide specific hardware details (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions architectural references like 'Di T (Peebles & Xie, 2023)' and 'EDM (Karras et al., 2022)' and 'Stable Diffusion VAE', but does not provide specific version numbers for any software libraries, programming languages, or tools used for implementation.
Experiment Setup Yes We summarize our best runs in Table 5. Specifically, for Image Net-256 256, we adopt a latent space paradigm for computational efficiency. For its autoencoder, we follow EDM2 (Karras et al., 2024) and pre-encode all images from Image Net into latents without flipping, and calculate the channel-wise mean and std for normalization. We use Stable Diffusion VAE and rescale the latents by channel mean [0.86488, 0.27787343, 0.21616915, 0.3738409] and channel std [4.85503674, 5.31922414, 3.93725398, 3.9870003]. After this normalization transformation, we further multiply the latents by 0.5 so that the latents roughly have std 0.5. For Di T architecture of different sizes, we use the same hyperparameters for all experiments. Table 5 details Training & Parameterization Settings such as 'cnoise(t) 1000t', 'Flow Trajectory OT-FM', 'gθ(xt, s, t) Simple-EDM Euler-FM', 'σd 0.5', 'Training iter 400K 1.2M', 'Batch Size 4096', 'Learning Rate 0.0001', 'Optimizer Adam W', 'Kernel Laplace', and other specific settings.