Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Authors: Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multiagent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization.
Researcher Affiliation Academia 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. 2Institute of Artificial Intelligence, Xiamen University.. Correspondence to: Xiawu Zheng <EMAIL>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The benchmark dataset, generation scripts, and evaluation code proposed in this paper will be publicly available at https://github.com/ MAC-Auto ML/abstract-reason-benchmark.
Open Datasets Yes Our domain-general, theoretically grounded benchmark with symbolic tasks addresses these limitations for rigorous abstract reasoning evaluation in LLMs.
Dataset Splits Yes Two distinct training data configurations were employed. The Unmapped Symbols Training (*) involved finetuning the model on 2,000 samples generated using original, unmapped symbols... Conversely, the Fully Mapped Symbols Training (**) fine-tuned the model on 2,000 samples where symbols (both operands and operators) were systematically remapped...
Hardware Specification Yes All experiments were conducted on local machines equipped with 8 NVIDIA GPUs, including A800 and 3090 models.
Software Dependencies No The paper mentions the use of 'Auto Tokenizer' and 'Paged AdamW 8-bit optimizer' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Key hyperparameters, based on our experimental script, included a per-device batch size of 4, 2 gradient accumulation steps, the Paged Adam W 8-bit optimizer, a learning rate of 2e-5, and a cosine learning rate scheduler with a warmup ratio of 0.03. BF16 precision was used, the seed was set to 42, and gradient checkpointing was enabled.