reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ScImage: How good are multimodal large language models at scientific text-to-image generation?

Authors: Leixin Zhang, Steffen Eger, Yinjie Cheng, Weihe Zhai, Jonas Belouadi, Fahimeh Moafian, Zhixue Zhao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we address this gap by introducing Sc Image, a benchmark designed to evaluate the multimodal capabilities of LLMs in generating scientific images from textual descriptions. ... We evaluate seven models, GPT-4o, Llama, Automa Tik Z, Dall-E, Stable Diffusion, GPT-o1 and Qwen2.5Coder-Instruct using two modes of output generation: code-based outputs (Python, Tik Z) and direct raster image generation. Additionally, we examine four different input languages: English, German, Farsi, and Chinese. Our evaluation, conducted with 11 scientists across three criteria (correctness, relevance, and scientific accuracy)...
Researcher Affiliation	Academia	1University of Twente EMAIL 2University of Technology Nuremberg EMAIL 3University of Sheffield EMAIL 4Harbin Institute of Technology EMAIL 5University of Mannheim EMAIL 6Technische Universität Dresden EMAIL
Pseudocode	No	The paper describes its methodology in prose and figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Sc Image Code: https://github. com/leixin-zhang/Scimage
Open Datasets	Yes	Sc Image prompts: https://huggingface.co/datasets/casszhao/Sc Image
Dataset Splits	Yes	In total, we have 101 query templates and 404 generation queries. Examples of Sc Image are shown in Table 9. ... For English, we evaluate 404 prompts for 7 different models... For the later multilingual phase, we evaluate 540 images...
Hardware Specification	No	The paper does not explicitly state specific hardware details (like GPU/CPU models or memory) used for running its experiments or evaluations.
Software Dependencies	No	The paper mentions software like Python, Tik Z, and Matplotlib, but does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	We instruct the models to generate scientific graphs with prompts. Each prompt consists of an auxiliary instruction and a generation query. The auxiliary instruction is used to constrain the model to generate scientific graphs in either (i) direct text-image or (ii) text-code-image mode. ... Our resulting auxiliary instructions are shown in Table 8 in Appendix B.