JetFormer: An autoregressive generative model of raw images and text
Authors: Michael Tschannen, André Susano Pinto, Alexander Kolesnikov
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on Image Net class-conditional image generation and on web-scale multimodal generation, thereby demonstrating the Jet Former works and scales to both text-to-image generation and vision-language understanding with a single model. ...Table 1: Comparison of Jet Former trained for 500 epochs on Image Net256 with baselines from the literature. ...Table 2: Summary of the main Jet Former (trained for 100 epochs) ablations performed on class-conditional Image Net256 generation. |
| Researcher Affiliation | Industry | Michael Tschannen , Andr e Susano Pinto , Alexander Kolesnikov Google Deep Mind EMAIL |
| Pseudocode | No | The paper describes methods in text and uses figures to illustrate concepts (e.g., Figure 1 for visualization of Jet Former training), but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | Reproducibility Statement We provide detailed information about the training recipe, the architecture, hyper-parameters and the training data in Section 4 and Appendix A. The paper does not explicitly state that source code is released or provide a link to a repository. |
| Open Datasets | Yes | For training class-conditional image generation models we use Image Net1k (Russakovsky et al., 2015). For multimodal generation we rely on the image-text pairs from Web LI data set (Chen et al., 2023b). ... For text-to-image, we adopt the common MS-COCO FID-30k... from MS-COCO (Lin et al., 2014)... Text VQA (Singh et al., 2019)... |
| Dataset Splits | Yes | For training class-conditional image generation models we use Image Net1k (Russakovsky et al., 2015). For multimodal generation we rely on the image-text pairs from Web LI data set (Chen et al., 2023b). In both cases, we resize the images so that the shorter side is 256 pixels while preserving the aspect ratio and extract a 256 256 central crop. ... For text-to-image, we adopt the common MS-COCO FID-30k, generating images for captions from 30k randomly sampled COCO validation images and evaluating FID against reference statistics from the full COCO validation set. |
| Hardware Specification | No | The paper discusses model architecture and training recipe but does not explicitly mention specific hardware components (e.g., GPU models, CPU types, or number of accelerators) used for running the experiments. |
| Software Dependencies | No | The paper mentions using a sentencepiece tokenizer and the Adam optimizer but does not provide specific version numbers for these or any other software libraries or dependencies used in the implementation. |
| Experiment Setup | Yes | We use the Adam optimizer with learning rate 10 3, decoupled weight decay of 10 4, β2 parameter 0.95 and clip the gradient norms to 1.0. We set the batch size to 4k. We also apply dropout with probability 0.1 at the output of the self-attention and MLP blocks, which we found to improve image sample quality. ...Table 6: Hyper-parameter details for fine-tuning tasks. |