ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Authors: Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, JUNJIE WANG, Mohan Jing, Linran XU, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, Yujiu Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct the examination of 17 LMMs on Chart Mimic (Sec. 3.2), including 3 proprietary models and 14 open-weight models across parameter sizes from 2.2B to 76.0B. We observe that while several open-weight models can match the performance of proprietary models such as GPT-4o on public leaderboards (Open Compass, 2023), a significant performance gap still persists on Chart Mimic. Specifically, the best open-weight model, Intern VL2-Llama3-76B, lags behind GPT-4o, with an average score gap of 20.6 on two tasks, indicating substantial room for improvement in the open-source community.
Researcher Affiliation Collaboration 1Tsinghua University 2Tencent AI Lab EMAIL
Pseudocode Yes Listing 1: An exemplary Python code for logging text information.
Open Source Code Yes Data and code are available at https://github.com/Chart Mimic/ Chart Mimic.
Open Datasets Yes Chart Mimic includes 4, 800 human-curated (figure, instruction, code) triplets... Data and code are available at https://github.com/Chart Mimic/ Chart Mimic.
Dataset Splits Yes We further divide the 4, 800 examples of Chart Mimic into two subsets: test and testmini set. The test set comprises 3, 600 examples, while the testmini set is composed of 1, 200 examples.
Hardware Specification Yes All models are inferred on A100 80G GPU.
Software Dependencies Yes a team of skilled Python users Python annotators master s students in computer science with 6+ years of Python and matplotlib experience reproduce 600 prototype charts using Python 3.9.0 and matplotlib v3.8.4.
Experiment Setup Yes for open-weight models, we set the temperature τ = 0.1 to achieve optimal results, while for proprietary models, we set the temperature τ = 0 for greedy decoding. For all models, we set the maximum generation length to 4096. Additionally, we use BF16 for model inference for open-weight models.