reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

JetFormer: An autoregressive generative model of raw images and text

Authors: Michael Tschannen, André Susano Pinto, Alexander Kolesnikov

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on Image Net class-conditional image generation and on web-scale multimodal generation, thereby demonstrating the Jet Former works and scales to both text-to-image generation and vision-language understanding with a single model. ...Table 1: Comparison of Jet Former trained for 500 epochs on Image Net256 with baselines from the literature. ...Table 2: Summary of the main Jet Former (trained for 100 epochs) ablations performed on class-conditional Image Net256 generation.
Researcher Affiliation	Industry	Michael Tschannen , Andr e Susano Pinto , Alexander Kolesnikov Google Deep Mind EMAIL
Pseudocode	No	The paper describes methods in text and uses figures to illustrate concepts (e.g., Figure 1 for visualization of Jet Former training), but does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	Reproducibility Statement We provide detailed information about the training recipe, the architecture, hyper-parameters and the training data in Section 4 and Appendix A. The paper does not explicitly state that source code is released or provide a link to a repository.
Open Datasets	Yes	For training class-conditional image generation models we use Image Net1k (Russakovsky et al., 2015). For multimodal generation we rely on the image-text pairs from Web LI data set (Chen et al., 2023b). ... For text-to-image, we adopt the common MS-COCO FID-30k... from MS-COCO (Lin et al., 2014)... Text VQA (Singh et al., 2019)...
Dataset Splits	Yes	For training class-conditional image generation models we use Image Net1k (Russakovsky et al., 2015). For multimodal generation we rely on the image-text pairs from Web LI data set (Chen et al., 2023b). In both cases, we resize the images so that the shorter side is 256 pixels while preserving the aspect ratio and extract a 256 256 central crop. ... For text-to-image, we adopt the common MS-COCO FID-30k, generating images for captions from 30k randomly sampled COCO validation images and evaluating FID against reference statistics from the full COCO validation set.
Hardware Specification	No	The paper discusses model architecture and training recipe but does not explicitly mention specific hardware components (e.g., GPU models, CPU types, or number of accelerators) used for running the experiments.
Software Dependencies	No	The paper mentions using a sentencepiece tokenizer and the Adam optimizer but does not provide specific version numbers for these or any other software libraries or dependencies used in the implementation.
Experiment Setup	Yes	We use the Adam optimizer with learning rate 10 3, decoupled weight decay of 10 4, β2 parameter 0.95 and clip the gradient norms to 1.0. We set the batch size to 4k. We also apply dropout with probability 0.1 at the output of the self-attention and MLP blocks, which we found to improve image sample quality. ...Table 6: Hyper-parameter details for fine-tuning tasks.