reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Elucidating the design space of language models for image generation

Authors: Xuantong Liu, Shaozhe Hao, Xianbiao Qi, Tianyang Hu, Jun Wang, Rong Xiao, Yuan Yao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	With extensive experiments, our proposed model, ELM achieves an FID of 1.54 on 256 256 Image Net and an FID of 3.29 on 512 512 Image Net, demonstrating the powerful generative potential of LLMs in vision tasks.
Researcher Affiliation	Collaboration	1Hong Kong University of Science and Technology 2The University of Hong Kong 3Intellifusion 4National University of Singapore.
Pseudocode	No	The paper describes methods like VQGAN, BAE, AR, and MLM using mathematical formulations (e.g., equations 2, 3, 4) and textual descriptions, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not explicitly state that the source code for the described methodology is being released, nor does it provide a link to a code repository.
Open Datasets	Yes	With extensive experiments, our proposed model, ELM achieves an FID of 1.54 on 256 256 Image Net and an FID of 3.29 on 512 512 Image Net, demonstrating the powerful generative potential of LLMs in vision tasks. ... Specifically, we select Celeb A (Liu et al., 2015)... and the Describable Texture Dataset (DTD) (Cimpoi et al., 2014)...
Dataset Splits	No	The paper mentions using Image Net, Celeb A, and Describable Texture Dataset (DTD) but does not specify the training, validation, or test dataset splits used for these datasets, only mentioning the number of generated samples for evaluation (e.g., 50,000 or 30,000).
Hardware Specification	Yes	All language models were trained on 80GB A800 GPUs with a batch size of 256, for 400 epochs, using a constant learning rate of 1e-4, weight decay of 0.05, and the Adam W optimizer with β1 0.9 and β2 0.95. The L and XL-sized models were trained on 8 A800 GPUs, requiring approximately 6.4 and 10 days, respectively, to complete 400 epochs. The XXL-sized model, trained on 16 A800 GPUs (2 nodes with 8 GPUs each), took around 12 days to finish training.
Software Dependencies	No	The paper mentions using the LLa MA-2 architecture and the Adam W optimizer, but it does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	All language models were trained on 80GB A800 GPUs with a batch size of 256, for 400 epochs, using a constant learning rate of 1e-4, weight decay of 0.05, and the Adam W optimizer with β1 0.9 and β2 0.95.