Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective
Authors: Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multiagent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization. |
| Researcher Affiliation | Academia | 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. 2Institute of Artificial Intelligence, Xiamen University.. Correspondence to: Xiawu Zheng <EMAIL>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The benchmark dataset, generation scripts, and evaluation code proposed in this paper will be publicly available at https://github.com/ MAC-Auto ML/abstract-reason-benchmark. |
| Open Datasets | Yes | Our domain-general, theoretically grounded benchmark with symbolic tasks addresses these limitations for rigorous abstract reasoning evaluation in LLMs. |
| Dataset Splits | Yes | Two distinct training data configurations were employed. The Unmapped Symbols Training (*) involved finetuning the model on 2,000 samples generated using original, unmapped symbols... Conversely, the Fully Mapped Symbols Training (**) fine-tuned the model on 2,000 samples where symbols (both operands and operators) were systematically remapped... |
| Hardware Specification | Yes | All experiments were conducted on local machines equipped with 8 NVIDIA GPUs, including A800 and 3090 models. |
| Software Dependencies | No | The paper mentions the use of 'Auto Tokenizer' and 'Paged AdamW 8-bit optimizer' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Key hyperparameters, based on our experimental script, included a per-device batch size of 4, 2 gradient accumulation steps, the Paged Adam W 8-bit optimizer, a learning rate of 2e-5, and a cosine learning rate scheduler with a warmup ratio of 0.03. BF16 precision was used, the seed was set to 42, and gradient checkpointing was enabled. |