Elucidating the design space of language models for image generation
Authors: Xuantong Liu, Shaozhe Hao, Xianbiao Qi, Tianyang Hu, Jun Wang, Rong Xiao, Yuan Yao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With extensive experiments, our proposed model, ELM achieves an FID of 1.54 on 256 256 Image Net and an FID of 3.29 on 512 512 Image Net, demonstrating the powerful generative potential of LLMs in vision tasks. |
| Researcher Affiliation | Collaboration | 1Hong Kong University of Science and Technology 2The University of Hong Kong 3Intellifusion 4National University of Singapore. |
| Pseudocode | No | The paper describes methods like VQGAN, BAE, AR, and MLM using mathematical formulations (e.g., equations 2, 3, 4) and textual descriptions, but it does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is being released, nor does it provide a link to a code repository. |
| Open Datasets | Yes | With extensive experiments, our proposed model, ELM achieves an FID of 1.54 on 256 256 Image Net and an FID of 3.29 on 512 512 Image Net, demonstrating the powerful generative potential of LLMs in vision tasks. ... Specifically, we select Celeb A (Liu et al., 2015)... and the Describable Texture Dataset (DTD) (Cimpoi et al., 2014)... |
| Dataset Splits | No | The paper mentions using Image Net, Celeb A, and Describable Texture Dataset (DTD) but does not specify the training, validation, or test dataset splits used for these datasets, only mentioning the number of generated samples for evaluation (e.g., 50,000 or 30,000). |
| Hardware Specification | Yes | All language models were trained on 80GB A800 GPUs with a batch size of 256, for 400 epochs, using a constant learning rate of 1e-4, weight decay of 0.05, and the Adam W optimizer with β1 0.9 and β2 0.95. The L and XL-sized models were trained on 8 A800 GPUs, requiring approximately 6.4 and 10 days, respectively, to complete 400 epochs. The XXL-sized model, trained on 16 A800 GPUs (2 nodes with 8 GPUs each), took around 12 days to finish training. |
| Software Dependencies | No | The paper mentions using the LLa MA-2 architecture and the Adam W optimizer, but it does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | All language models were trained on 80GB A800 GPUs with a batch size of 256, for 400 epochs, using a constant learning rate of 1e-4, weight decay of 0.05, and the Adam W optimizer with β1 0.9 and β2 0.95. |