ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Authors: Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, JUNJIE WANG, Mohan Jing, Linran XU, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, Yujiu Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct the examination of 17 LMMs on Chart Mimic (Sec. 3.2), including 3 proprietary models and 14 open-weight models across parameter sizes from 2.2B to 76.0B. We observe that while several open-weight models can match the performance of proprietary models such as GPT-4o on public leaderboards (Open Compass, 2023), a significant performance gap still persists on Chart Mimic. Specifically, the best open-weight model, Intern VL2-Llama3-76B, lags behind GPT-4o, with an average score gap of 20.6 on two tasks, indicating substantial room for improvement in the open-source community. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Tencent AI Lab EMAIL |
| Pseudocode | Yes | Listing 1: An exemplary Python code for logging text information. |
| Open Source Code | Yes | Data and code are available at https://github.com/Chart Mimic/ Chart Mimic. |
| Open Datasets | Yes | Chart Mimic includes 4, 800 human-curated (figure, instruction, code) triplets... Data and code are available at https://github.com/Chart Mimic/ Chart Mimic. |
| Dataset Splits | Yes | We further divide the 4, 800 examples of Chart Mimic into two subsets: test and testmini set. The test set comprises 3, 600 examples, while the testmini set is composed of 1, 200 examples. |
| Hardware Specification | Yes | All models are inferred on A100 80G GPU. |
| Software Dependencies | Yes | a team of skilled Python users Python annotators master s students in computer science with 6+ years of Python and matplotlib experience reproduce 600 prototype charts using Python 3.9.0 and matplotlib v3.8.4. |
| Experiment Setup | Yes | for open-weight models, we set the temperature τ = 0.1 to achieve optimal results, while for proprietary models, we set the temperature τ = 0 for greedy decoding. For all models, we set the maximum generation length to 4096. Additionally, we use BF16 for model inference for open-weight models. |