reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Authors: Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, JUNJIE WANG, Mohan Jing, Linran XU, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, Yujiu Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct the examination of 17 LMMs on Chart Mimic (Sec. 3.2), including 3 proprietary models and 14 open-weight models across parameter sizes from 2.2B to 76.0B. We observe that while several open-weight models can match the performance of proprietary models such as GPT-4o on public leaderboards (Open Compass, 2023), a significant performance gap still persists on Chart Mimic. Specifically, the best open-weight model, Intern VL2-Llama3-76B, lags behind GPT-4o, with an average score gap of 20.6 on two tasks, indicating substantial room for improvement in the open-source community.
Researcher Affiliation	Collaboration	1Tsinghua University 2Tencent AI Lab EMAIL
Pseudocode	Yes	Listing 1: An exemplary Python code for logging text information.
Open Source Code	Yes	Data and code are available at https://github.com/Chart Mimic/ Chart Mimic.
Open Datasets	Yes	Chart Mimic includes 4, 800 human-curated (figure, instruction, code) triplets... Data and code are available at https://github.com/Chart Mimic/ Chart Mimic.
Dataset Splits	Yes	We further divide the 4, 800 examples of Chart Mimic into two subsets: test and testmini set. The test set comprises 3, 600 examples, while the testmini set is composed of 1, 200 examples.
Hardware Specification	Yes	All models are inferred on A100 80G GPU.
Software Dependencies	Yes	a team of skilled Python users Python annotators master s students in computer science with 6+ years of Python and matplotlib experience reproduce 600 prototype charts using Python 3.9.0 and matplotlib v3.8.4.
Experiment Setup	Yes	for open-weight models, we set the temperature τ = 0.1 to achieve optimal results, while for proprietary models, we set the temperature τ = 0 for greedy decoding. For all models, we set the maximum generation length to 4096. Additionally, we use BF16 for model inference for open-weight models.