Normalizing Flows are Capable Generative Models

Authors: Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Ángel Bautista, Navdeep Jaitly, Joshua M. Susskind

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3. Experiments We perform our experiments on unconditional Image Net 64x64 (van den Oord et al., 2016b), as well as class conditional Image Net 64x64, Image Net 128x128 (Deng et al., 2009) and AFHQ 256x256 (Choi et al., 2020).
Researcher Affiliation Industry 1Apple. Correspondence to: Shuangfei Zhai <EMAIL>.
Pseudocode No The paper describes mathematical formulations and steps in prose, for example, Equation 3, Equation 4 and Equation 8, but does not contain a clearly labeled "Pseudocode" or "Algorithm" block, nor a structured procedure formatted like code.
Open Source Code Yes We make our code available at https://github.com/apple/mltarflow.
Open Datasets Yes We perform our experiments on unconditional Image Net 64x64 (van den Oord et al., 2016b), as well as class conditional Image Net 64x64, Image Net 128x128 (Deng et al., 2009) and AFHQ 256x256 (Choi et al., 2020).
Dataset Splits Yes We perform our experiments on unconditional Image Net 64x64 (van den Oord et al., 2016b), as well as class conditional Image Net 64x64, Image Net 128x128 (Deng et al., 2009) and AFHQ 256x256 (Choi et al., 2020). For each setting, we randomly generate 50K samples, and compare it with the statistics from the entire training set.
Hardware Specification Yes Our models are implemented with Py Torch, and our experiments are conducted on A100 GPUs.
Software Dependencies No Our models are implemented with Py Torch, and our experiments are conducted on A100 GPUs. We by default cast the model to bfloat16, which provides significant memory savings, with the exception of the likelihood task where we found that float32 is necessary to avoid numerical issues. This only mentions "Py Torch" without a specific version number.
Experiment Setup Yes All parameters are trained end-to-end with the Adam W optimizer with momentum (0.9, 0.95). We use a cosine learning rate schedule, where the learning rate is warmed up from 10^-6 to 10^-4 for one epoch, then decayed to 10^-6. We use a small weight decay of 10^-4 to stabilize training. We adopt a simple data preprocessing protocol, where we center crop images and linearly rescale the pixels to [-1, 1]. Table 7. Hyper parameters for the best performing model on each task.