Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Authors: Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The abstract mentions "state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as Parti Prompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti". Section 5 is titled "Experiments" and discusses "automatic evaluations on both MS-COCO and Localized Narratives" and "human side-by-side evaluations".
Researcher Affiliation Industry All authors are listed with the affiliation "Google Research" and email addresses ending with "@google.com" (e.g., EMAIL).
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. Figure 3 presents a system overview, but it is a diagram, not pseudocode.
Open Source Code No Section 8, "Broader Impacts", states: "These considerations all contribute to our decision not to release our models, code or data at this time."
Open Datasets Yes Section 4.1 "Training Datasets" lists publicly available datasets: "LAION-400M dataset (Schuhmann et al., 2021); FIT400M... used to train the ALIGN model (Jia et al., 2021a); JFT-4B dataset (Zhai et al., 2022)". Section 4.2 "Evaluation Datasets" mentions "MS-COCO (2014) (Lin et al., 2014) and Localized Narratives (Pont-Tuset et al., 2020)". Additionally, Section 9 "Conclusion" states, "To this end, the Parti Prompts (P2) benchmark that we release with this work are intentionally crafted to induce many of these error types."
Dataset Splits Yes Section 4.2 "Evaluation Datasets" states: "MS-COCO (2014) (Lin et al., 2014) 82K Train, 40K Val" and "Localized Narratives (COCO subset) (Pont-Tuset et al., 2020) 134K Train, 8K Val". It further specifies, "We use 30,000 generated and real image samples for evaluation on MS-COCO (2014)" and "The validation set of the Localized Narratives COCO split contains only 5,000 unique images, so we follow (Zhang et al., 2021) in oversampling the captions to acquire 30,000 generated images."
Hardware Specification Yes Section 3 "Scaling and Parallelization" states: "We implement our models in Lingvo (Shen et al., 2019) and scale with GSPMD (Xu et al., 2021) on Cloud TPUv4 hardware for both training and inference."
Software Dependencies No The paper mentions several software components like "Lingvo (Shen et al., 2019)", "GSPMD (Xu et al., 2021)", "Adafactor (Shazeer & Stern, 2018) optimizer", "BERT (Devlin et al., 2019) pretraining objective", and "XLA compiler-based model partitioning system". However, specific version numbers for these software dependencies are not provided in the text.
Experiment Setup Yes Section 2.1 details the Vi T-VQGAN tokenizer configuration: "8 blocks, 8 heads, model dimension 512, and hidden dimension 2048" with "about 30M total parameters", and a larger finetuned decoder with "32 blocks, 16 heads, model dimension 1280, and hidden dimension 5120, with about 600M total parameters". Section 2.2 specifies "maximum length of text tokens of 128, and the length of image tokens are fixed to 1024". Table 1 provides parameters for four model sizes (350M, 750M, 3B, 20B). Section 3 on Training states: "Adafactor (Shazeer & Stern, 2018) optimizer is used to save memory with β1 = 0.9, β2 = 0.96 and decoupled weight decay value of 4.5 10 2", "default dropout ratio 0.1", "Data types are cast to bfloat16 for attention projection and feed-forward transformers layers, while all layer norms and model output are kept as float32. We use a default learning rate of 4.5e-5 and exponential learning rate schedule with 5,000 warm-up steps. Exponential decaying starts at training steps 85,000 with a total of 450,000 steps and final ratio of 0.025. We use a global batch size of 8192 during training. We additionally clip gradient norm to a value of 4.0."