reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Authors: Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixedmodality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uniand cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. ... We demonstrate in a series of controlled experiments that Transfusion is a viable, scalable method for training a unified multi-modal model. The setup of our experiments is detailed in Appendix B.1.
Researcher Affiliation	Collaboration	Chunting Zhouµ Lili Yuµ Arun Babuµ Kushal Tirumalaµ Michihiro Yasunagaµ Leonid Shamisµ Jacob Kahnµ Xuezhe Maσ Luke Zettlemoyerµ Omer Levyµ µ Work done at Meta σ University of Southern California
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks. Procedures are described in descriptive text and mathematical formulas.
Open Source Code	No	The paper does not contain an explicit statement about releasing code, nor does it provide a link to a code repository. It mentions using and comparing against other models/frameworks like Chameleon and Llama but does not state releasing code for Transfusion.
Open Datasets	Yes	For text-to-text, we measure perplexity on 20M held-out tokens from Wikipedia and the C4 corpus (Raffel et al., 2019), as well as accuracy on the pretraining evaluation suite of Llama 2 (Touvron et al., 2023b). For text-to-image, we use the MS-COCO benchmark (Lin et al., 2014), where we generate images on randomly selected 30k prompts from validation set and measure their photo-realism using zero-shot Frechet Inception Distance (FID) (Heusel et al., 2017) as well as their alignment with the prompts using CLIP score (Radford et al., 2021). ... We also add data from Conceptual 12M (CC12M) (Changpinyo et al., 2021), reaching a total mixture of 692M image-caption pairs per epoch.
Dataset Splits	Yes	For text-to-text, we measure perplexity on 20M held-out tokens from Wikipedia and the C4 corpus (Raffel et al., 2019), as well as accuracy on the pretraining evaluation suite of Llama 2 (Touvron et al., 2023b). For text-to-image, we use the MS-COCO benchmark (Lin et al., 2014), where we generate images on randomly selected 30k prompts from validation set and measure their photo-realism using zero-shot Frechet Inception Distance (FID) (Heusel et al., 2017) as well as their alignment with the prompts using CLIP score (Radford et al., 2021). ... We report CIDEr (Vedantam et al., 2015) scores on the Karpathy test split of MS-COCO (Lin et al., 2014).
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as GPU models, CPU models, or cloud computing instance types. It discusses training parameters and data but omits hardware specifications.
Software Dependencies	No	The paper does not specify the versions of any ancillary software dependencies used for the experiments. It mentions using the Llama 2 tokenizer but no version.
Experiment Setup	Yes	We use Adam W (β1 =0.9, β2 =0.95, ϵ =1e-8) with a learning rate of 3e-4, warmed up for 4000 steps and decaying to 1.5e-5 using a cosine scheduler. We train on sequences of 4096 tokens in batches of 2M tokens for 250k steps, reaching 0.5T tokens in total. In our large-scale experiment ( 4.4), we train with a batch size of 4M tokens over 500k steps, totalling 2T tokens. We set the λ coefficient in the Transfusion objective (Equation 4) to 5 following preliminary experiments. ... For image generation, we follow the standard of 250 diffusion steps (the model is trained on 1,000 timesteps). We follow Chameleon and use CFG with a coefficient of 5 in the controlled comparison experiments ( 4.2). This value is suboptimal for Transfusion, and so we use a CFG coefficient of 3 throughout the ablation experiments ( 4.3), and follow the standard practice of tuning the coefficient for each benchmark in our large scale experiment ( 4.4).