ScImage: How good are multimodal large language models at scientific text-to-image generation?

Authors: Leixin Zhang, Steffen Eger, Yinjie Cheng, Weihe Zhai, Jonas Belouadi, Fahimeh Moafian, Zhixue Zhao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we address this gap by introducing Sc Image, a benchmark designed to evaluate the multimodal capabilities of LLMs in generating scientific images from textual descriptions. ... We evaluate seven models, GPT-4o, Llama, Automa Tik Z, Dall-E, Stable Diffusion, GPT-o1 and Qwen2.5Coder-Instruct using two modes of output generation: code-based outputs (Python, Tik Z) and direct raster image generation. Additionally, we examine four different input languages: English, German, Farsi, and Chinese. Our evaluation, conducted with 11 scientists across three criteria (correctness, relevance, and scientific accuracy)...
Researcher Affiliation Academia 1University of Twente EMAIL 2University of Technology Nuremberg EMAIL 3University of Sheffield EMAIL 4Harbin Institute of Technology EMAIL 5University of Mannheim EMAIL 6Technische Universität Dresden EMAIL
Pseudocode No The paper describes its methodology in prose and figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Sc Image Code: https://github. com/leixin-zhang/Scimage
Open Datasets Yes Sc Image prompts: https://huggingface.co/datasets/casszhao/Sc Image
Dataset Splits Yes In total, we have 101 query templates and 404 generation queries. Examples of Sc Image are shown in Table 9. ... For English, we evaluate 404 prompts for 7 different models... For the later multilingual phase, we evaluate 540 images...
Hardware Specification No The paper does not explicitly state specific hardware details (like GPU/CPU models or memory) used for running its experiments or evaluations.
Software Dependencies No The paper mentions software like Python, Tik Z, and Matplotlib, but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We instruct the models to generate scientific graphs with prompts. Each prompt consists of an auxiliary instruction and a generation query. The auxiliary instruction is used to constrain the model to generate scientific graphs in either (i) direct text-image or (ii) text-code-image mode. ... Our resulting auxiliary instructions are shown in Table 8 in Appendix B.