reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Authors: Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multiagent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization.
Researcher Affiliation	Academia	1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. 2Institute of Artificial Intelligence, Xiamen University.. Correspondence to: Xiawu Zheng <EMAIL>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The benchmark dataset, generation scripts, and evaluation code proposed in this paper will be publicly available at https://github.com/ MAC-Auto ML/abstract-reason-benchmark.
Open Datasets	Yes	Our domain-general, theoretically grounded benchmark with symbolic tasks addresses these limitations for rigorous abstract reasoning evaluation in LLMs.
Dataset Splits	Yes	Two distinct training data configurations were employed. The Unmapped Symbols Training () involved finetuning the model on 2,000 samples generated using original, unmapped symbols... Conversely, the Fully Mapped Symbols Training (*) fine-tuned the model on 2,000 samples where symbols (both operands and operators) were systematically remapped...
Hardware Specification	Yes	All experiments were conducted on local machines equipped with 8 NVIDIA GPUs, including A800 and 3090 models.
Software Dependencies	No	The paper mentions the use of 'Auto Tokenizer' and 'Paged AdamW 8-bit optimizer' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	Key hyperparameters, based on our experimental script, included a per-device batch size of 4, 2 gradient accumulation steps, the Paged Adam W 8-bit optimizer, a learning rate of 2e-5, and a cosine learning rate scheduler with a warmup ratio of 0.03. BF16 precision was used, the seed was set to 42, and gradient checkpointing was enabled.