reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Jet: A Modern Transformer-Based Normalizing Flow

Authors: Alexander Kolesnikov, André Susano Pinto, Michael Tschannen

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we revisit the design of coupling-based normalizing ﬂow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve a much simpler architecture that matches existing normalizing ﬂow models and improves over them when paired with pretraining. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing ﬂow models can help advancing the research frontier by serving as building components of more powerful generative models. 3 Experiments Throughout the paper, we keep our experimental setup simple and uniﬁed.
Researcher Affiliation	Industry	Alexander Kolesnikov EMAIL Google Deep Mind André Susano Pinto EMAIL Google Deep Mind Michael Tschannen EMAIL Google Deep Mind
Pseudocode	No	The paper describes mathematical formulations for coupling blocks and overall model structure, but does not include any explicitly labeled pseudocode or algorithm blocks. For example, a single coupling block is formalized as: y1 = x1 y2 = (x2 + b(x1)) σ(s(x1)) m, which is an equation, not pseudocode.
Open Source Code	Yes	Throughout the paper, we keep our experimental setup simple and uniﬁed. Additionally, the code is available in the big_vision codebase1. 1https://github.com/google-research/big_vision
Open Datasets	Yes	Datasets. We perform experiments on three datasets: Imagenet-1k, Imagenet-21k and CIFAR-10, across two input resolutions: 32 32 and 64 64 (except for CIFAR-10). To downsample Imagenet-1k images we follow the standard protocol (Van den Oord et al., 2016b) to ensure a correct comparison to the prior art (i.e. we use the preprocessed data provided by (Van den Oord et al., 2016b) where available). For CIFAR-10 we use the original dataset resolution.
Dataset Splits	No	The paper mentions using ImageNet-1k, ImageNet-21k, and CIFAR-10 datasets and refers to a 'validation NLL' implying validation sets are used. However, it does not explicitly state specific percentages, sample counts, or detailed methodologies for how these datasets were split into training, validation, and test sets. It mentions following 'standard protocol' for downsampling ImageNet-1k but does not elaborate on split details.
Hardware Specification	Yes	Figure 2: Eﬀect of diﬀerent architecture design choices on the validation NLL (in bits per dimension), as a function of training compute. Figure 2a: Results on Image Net-1k 64 64 for CNN vs Vi T blocks (the marker size is proportional to the model parameter count). Vi T blocks clearly outperform CNN blocks for a given training compute budget. Figure 2b: Results on Image Net-21k 32 32 for diﬀerent Vi T depths. Increasing the block depth leads to improved results up to depth 5. (The x-axis for Figure 2 and Figure 3 labels 'TPU Core-hours' indicating the use of TPUs for computation).
Software Dependencies	No	The paper mentions using Adam W for the optimizer and TensorFlow for image resizing operations, and refers to the 'big_vision codebase'. However, it does not provide specific version numbers for any of these software components (e.g., 'TensorFlow 2.x' or 'big_vision 1.0').
Experiment Setup	Yes	For the optimizer we use Adam W (Loshchilov et al., 2017). We set second momentum β2 parameter to 0.95 to stabilize training. We use a cosine learning rate decay schedule. We use a ﬁxed standard learning rate of 3e 4, weight decay of 1e 5 and train for 200 epochs for Image Net1k and for 50 epochs for Image Net-21k. We additionally investigate transfer learning setup and ﬁnetune our best Image Net-21k models on Image Net-1k and CIFAR-10. A.1 Architecture details Table 3: Architecture details for models used to obtain the main results in Table 1. (This table specifies 'Coupling layers', 'Vi T depth', 'Vi T width', and 'Vi T attention heads' for different datasets and resolutions).