reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science

Authors: Erle Zhu, Yadi Liu, Zhe Zhang, Xujun Li, JinZhou, Xinjie Yu, Minlie Huang, Hongning Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Validated using our collected collegelevel circuit analysis problems, MAPS significantly improves reasoning accuracy of MLLM and outperforms all existing models. The results confirm MAPS offers a promising direction for enhancing multi-modal scientific reasoning ability of MLLMs. Our code is available at https://github.com/thu-coai/MAPS.
Researcher Affiliation	Academia	1The Conversational AI (Co AI) Group, 2Department of Computer Science & Technology 3Department of Electrical Engineering Tsinghua University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 MAPS: Inference Phase
Open Source Code	Yes	Our code is available at https://github.com/thu-coai/MAPS.
Open Datasets	No	To evaluate the entire MAPS framework on real-world physical problems, we collected 79 high-quality circuit analysis problems from related textbooks and name it Simple Circuit Eval. Simple Circuit Eval is constrcuted based on exercise problems primarily collected Chinese circuit analysis text books, but since current MLLMs are primarily multilingual and the linguistic type is not an influencing factor in our framework, this should not affect the evaluation of different MLLMs on this dataset.
Dataset Splits	Yes	ppm-syn-lprc contains 20k pairs of synthetic circuit diagrams and their simulation descriptions, divided into training, validation, and test sets in a ratio of 8:1:1.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU/CPU models or other detailed computer specifications for their own experimental setup. It mentions using Cog VLM-17B and GPT-4V, which are models, but not the hardware they ran these models on for their experiments.
Software Dependencies	No	We use Ng SPICE (Nenzi & Vogt, 2011) developed by the UC Berkeley CAD Group as our simulator.
Experiment Setup	Yes	We list our main hyperparameters used for PPM training at Table 6. Table 6: Main Hyper-parameters of PPM Training Param. Setting lora-rank 50 max-length 2000 batch-size 32 train-iters 2000 optimizer Adam learning-rate 1e-5 lr-decay-style cosine warmup 0.2