Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Authors: Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixedmodality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uniand cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. ... We demonstrate in a series of controlled experiments that Transfusion is a viable, scalable method for training a unified multi-modal model. The setup of our experiments is detailed in Appendix B.1. |
| Researcher Affiliation | Collaboration | Chunting Zhouµ Lili Yuµ Arun Babuµ Kushal Tirumalaµ Michihiro Yasunagaµ Leonid Shamisµ Jacob Kahnµ Xuezhe Maσ Luke Zettlemoyerµ Omer Levyµ µ Work done at Meta σ University of Southern California |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. Procedures are described in descriptive text and mathematical formulas. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code, nor does it provide a link to a code repository. It mentions using and comparing against other models/frameworks like Chameleon and Llama but does not state releasing code for Transfusion. |
| Open Datasets | Yes | For text-to-text, we measure perplexity on 20M held-out tokens from Wikipedia and the C4 corpus (Raffel et al., 2019), as well as accuracy on the pretraining evaluation suite of Llama 2 (Touvron et al., 2023b). For text-to-image, we use the MS-COCO benchmark (Lin et al., 2014), where we generate images on randomly selected 30k prompts from validation set and measure their photo-realism using zero-shot Frechet Inception Distance (FID) (Heusel et al., 2017) as well as their alignment with the prompts using CLIP score (Radford et al., 2021). ... We also add data from Conceptual 12M (CC12M) (Changpinyo et al., 2021), reaching a total mixture of 692M image-caption pairs per epoch. |
| Dataset Splits | Yes | For text-to-text, we measure perplexity on 20M held-out tokens from Wikipedia and the C4 corpus (Raffel et al., 2019), as well as accuracy on the pretraining evaluation suite of Llama 2 (Touvron et al., 2023b). For text-to-image, we use the MS-COCO benchmark (Lin et al., 2014), where we generate images on randomly selected 30k prompts from validation set and measure their photo-realism using zero-shot Frechet Inception Distance (FID) (Heusel et al., 2017) as well as their alignment with the prompts using CLIP score (Radford et al., 2021). ... We report CIDEr (Vedantam et al., 2015) scores on the Karpathy test split of MS-COCO (Lin et al., 2014). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU models, CPU models, or cloud computing instance types. It discusses training parameters and data but omits hardware specifications. |
| Software Dependencies | No | The paper does not specify the versions of any ancillary software dependencies used for the experiments. It mentions using the Llama 2 tokenizer but no version. |
| Experiment Setup | Yes | We use Adam W (β1 =0.9, β2 =0.95, ϵ =1e-8) with a learning rate of 3e-4, warmed up for 4000 steps and decaying to 1.5e-5 using a cosine scheduler. We train on sequences of 4096 tokens in batches of 2M tokens for 250k steps, reaching 0.5T tokens in total. In our large-scale experiment ( 4.4), we train with a batch size of 4M tokens over 500k steps, totalling 2T tokens. We set the λ coefficient in the Transfusion objective (Equation 4) to 5 following preliminary experiments. ... For image generation, we follow the standard of 250 diffusion steps (the model is trained on 1,000 timesteps). We follow Chameleon and use CFG with a coefficient of 5 in the controlled comparison experiments ( 4.2). This value is suboptimal for Transfusion, and so we use a CFG coefficient of 3 throughout the ablation experiments ( 4.3), and follow the standard practice of tuning the coefficient for each benchmark in our large scale experiment ( 4.4). |