Jet: A Modern Transformer-Based Normalizing Flow

Authors: Alexander Kolesnikov, André Susano Pinto, Michael Tschannen

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we revisit the design of coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve a much simpler architecture that matches existing normalizing flow models and improves over them when paired with pretraining. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing the research frontier by serving as building components of more powerful generative models. 3 Experiments Throughout the paper, we keep our experimental setup simple and unified.
Researcher Affiliation Industry Alexander Kolesnikov EMAIL Google Deep Mind André Susano Pinto EMAIL Google Deep Mind Michael Tschannen EMAIL Google Deep Mind
Pseudocode No The paper describes mathematical formulations for coupling blocks and overall model structure, but does not include any explicitly labeled pseudocode or algorithm blocks. For example, a single coupling block is formalized as: y1 = x1 y2 = (x2 + b(x1)) σ(s(x1)) m, which is an equation, not pseudocode.
Open Source Code Yes Throughout the paper, we keep our experimental setup simple and unified. Additionally, the code is available in the big_vision codebase1. 1https://github.com/google-research/big_vision
Open Datasets Yes Datasets. We perform experiments on three datasets: Imagenet-1k, Imagenet-21k and CIFAR-10, across two input resolutions: 32 32 and 64 64 (except for CIFAR-10). To downsample Imagenet-1k images we follow the standard protocol (Van den Oord et al., 2016b) to ensure a correct comparison to the prior art (i.e. we use the preprocessed data provided by (Van den Oord et al., 2016b) where available). For CIFAR-10 we use the original dataset resolution.
Dataset Splits No The paper mentions using ImageNet-1k, ImageNet-21k, and CIFAR-10 datasets and refers to a 'validation NLL' implying validation sets are used. However, it does not explicitly state specific percentages, sample counts, or detailed methodologies for how these datasets were split into training, validation, and test sets. It mentions following 'standard protocol' for downsampling ImageNet-1k but does not elaborate on split details.
Hardware Specification Yes Figure 2: Effect of different architecture design choices on the validation NLL (in bits per dimension), as a function of training compute. Figure 2a: Results on Image Net-1k 64 64 for CNN vs Vi T blocks (the marker size is proportional to the model parameter count). Vi T blocks clearly outperform CNN blocks for a given training compute budget. Figure 2b: Results on Image Net-21k 32 32 for different Vi T depths. Increasing the block depth leads to improved results up to depth 5. (The x-axis for Figure 2 and Figure 3 labels 'TPU Core-hours' indicating the use of TPUs for computation).
Software Dependencies No The paper mentions using Adam W for the optimizer and TensorFlow for image resizing operations, and refers to the 'big_vision codebase'. However, it does not provide specific version numbers for any of these software components (e.g., 'TensorFlow 2.x' or 'big_vision 1.0').
Experiment Setup Yes For the optimizer we use Adam W (Loshchilov et al., 2017). We set second momentum β2 parameter to 0.95 to stabilize training. We use a cosine learning rate decay schedule. We use a fixed standard learning rate of 3e 4, weight decay of 1e 5 and train for 200 epochs for Image Net1k and for 50 epochs for Image Net-21k. We additionally investigate transfer learning setup and finetune our best Image Net-21k models on Image Net-1k and CIFAR-10. A.1 Architecture details Table 3: Architecture details for models used to obtain the main results in Table 1. (This table specifies 'Coupling layers', 'Vi T depth', 'Vi T width', and 'Vi T attention heads' for different datasets and resolutions).