reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Normalizing Flows are Capable Generative Models

Authors: Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Ángel Bautista, Navdeep Jaitly, Joshua M. Susskind

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	3. Experiments We perform our experiments on unconditional Image Net 64x64 (van den Oord et al., 2016b), as well as class conditional Image Net 64x64, Image Net 128x128 (Deng et al., 2009) and AFHQ 256x256 (Choi et al., 2020).
Researcher Affiliation	Industry	1Apple. Correspondence to: Shuangfei Zhai <EMAIL>.
Pseudocode	No	The paper describes mathematical formulations and steps in prose, for example, Equation 3, Equation 4 and Equation 8, but does not contain a clearly labeled "Pseudocode" or "Algorithm" block, nor a structured procedure formatted like code.
Open Source Code	Yes	We make our code available at https://github.com/apple/mltarflow.
Open Datasets	Yes	We perform our experiments on unconditional Image Net 64x64 (van den Oord et al., 2016b), as well as class conditional Image Net 64x64, Image Net 128x128 (Deng et al., 2009) and AFHQ 256x256 (Choi et al., 2020).
Dataset Splits	Yes	We perform our experiments on unconditional Image Net 64x64 (van den Oord et al., 2016b), as well as class conditional Image Net 64x64, Image Net 128x128 (Deng et al., 2009) and AFHQ 256x256 (Choi et al., 2020). For each setting, we randomly generate 50K samples, and compare it with the statistics from the entire training set.
Hardware Specification	Yes	Our models are implemented with Py Torch, and our experiments are conducted on A100 GPUs.
Software Dependencies	No	Our models are implemented with Py Torch, and our experiments are conducted on A100 GPUs. We by default cast the model to bfloat16, which provides significant memory savings, with the exception of the likelihood task where we found that float32 is necessary to avoid numerical issues. This only mentions "Py Torch" without a specific version number.
Experiment Setup	Yes	All parameters are trained end-to-end with the Adam W optimizer with momentum (0.9, 0.95). We use a cosine learning rate schedule, where the learning rate is warmed up from 10^-6 to 10^-4 for one epoch, then decayed to 10^-6. We use a small weight decay of 10^-4 to stabilize training. We adopt a simple data preprocessing protocol, where we center crop images and linearly rescale the pixels to [-1, 1]. Table 7. Hyper parameters for the best performing model on each task.