ScImage: How good are multimodal large language models at scientific text-to-image generation?
Authors: Leixin Zhang, Steffen Eger, Yinjie Cheng, Weihe Zhai, Jonas Belouadi, Fahimeh Moafian, Zhixue Zhao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we address this gap by introducing Sc Image, a benchmark designed to evaluate the multimodal capabilities of LLMs in generating scientific images from textual descriptions. ... We evaluate seven models, GPT-4o, Llama, Automa Tik Z, Dall-E, Stable Diffusion, GPT-o1 and Qwen2.5Coder-Instruct using two modes of output generation: code-based outputs (Python, Tik Z) and direct raster image generation. Additionally, we examine four different input languages: English, German, Farsi, and Chinese. Our evaluation, conducted with 11 scientists across three criteria (correctness, relevance, and scientific accuracy)... |
| Researcher Affiliation | Academia | 1University of Twente EMAIL 2University of Technology Nuremberg EMAIL 3University of Sheffield EMAIL 4Harbin Institute of Technology EMAIL 5University of Mannheim EMAIL 6Technische Universität Dresden EMAIL |
| Pseudocode | No | The paper describes its methodology in prose and figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Sc Image Code: https://github. com/leixin-zhang/Scimage |
| Open Datasets | Yes | Sc Image prompts: https://huggingface.co/datasets/casszhao/Sc Image |
| Dataset Splits | Yes | In total, we have 101 query templates and 404 generation queries. Examples of Sc Image are shown in Table 9. ... For English, we evaluate 404 prompts for 7 different models... For the later multilingual phase, we evaluate 540 images... |
| Hardware Specification | No | The paper does not explicitly state specific hardware details (like GPU/CPU models or memory) used for running its experiments or evaluations. |
| Software Dependencies | No | The paper mentions software like Python, Tik Z, and Matplotlib, but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We instruct the models to generate scientific graphs with prompts. Each prompt consists of an auxiliary instruction and a generation query. The auxiliary instruction is used to constrain the model to generate scientific graphs in either (i) direct text-image or (ii) text-code-image mode. ... Our resulting auxiliary instructions are shown in Table 8 in Appendix B. |