Inverse Scaling in Test-Time Compute
Authors: Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer... |
| Researcher Affiliation | Collaboration | Aryo Pradipta Gema EMAIL Anthropic Fellows Program, University of Edinburgh Alexander Hägele Anthropic Fellows Program, EPFL Runjin Chen Anthropic Fellows Program, University of Texas at Austin Andy Arditi Anthropic Fellows Program Jacob Goldman-Wetzler Anthropic Fellows Program Kit Fraser-Taliente Anthropic Fellows Program Henry Sleight Constellation Linda Petrini Independent Julian Michael Scale AI Beatrice Alex University of Edinburgh Pasquale Minervini University of Edinburgh, Miniml.AI Yanda Chen Anthropic Joe Benton Anthropic Ethan Perez EMAIL Anthropic |
| Pseudocode | No | The paper describes experimental setups and methods in paragraph text and figures, but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our code and demo are available at https://safety-research.github.io/inverse-scaling-ttc. ... Our code is available at https://github.com/safety-research/inverse-scaling-ttc |
| Open Datasets | Yes | Our evaluation framework consists of 21 tasks across two categories, as detailed in Table 3. The dataset is publicly available in https://huggingface.co/datasets/inverse-scaling-ttc/ inverse-scaling-ttc-main and includes a unique identifier string (i.e., canary string) to prevent this evaluation data from being inadvertently used to train future models. |
| Dataset Splits | Yes | We select 500 students from the original dataset as test instances and evaluate each under three conditions: zero-shot, 8-shot, and 16-shot settings. In the zero-shot setting, we aim to understand how extended reasoning affects models priors about relationships, testing whether models maintain reasonable assumptions (e.g., study hours matter for grades) or shift to plausible but incorrect features under extended reasoning. In the few-shot settings, we test whether models can learn to focus on genuinely predictive features when provided with few-shot examples, or if they remain susceptible to spurious correlations despite access to ground-truth data. Each few-shot example consists of a student s features paired with their grade, randomly sampled from the remaining students to avoid overlap with the test instance. |
| Hardware Specification | Yes | We use 8 NVIDIA H200s to run the open-weight models. |
| Software Dependencies | No | The paper mentions using 'safety-tooling' and 'vLLM' libraries in Appendix A.3, but only 'safety-tooling' provides a version number (v1.0.0) in its reference entry, while 'vLLM' does not, which is insufficient for multiple versioned software components. |
| Experiment Setup | Yes | For both setups, we use a default temperature of 1.0 for Claude and Open AI models and the recommended 0.6 for open-weight models. We run multiple trials to ensure robust sampling: three repetitions per budget condition for controlled overthinking experiments and five repetitions for natural overthinking experiments. ... For Claude and open-weight models, we specify an integer denoting the maximum number of tokens the model should use to reason (e.g., 0 , 1,024 , 2,048 , 4,096 ), while for o-series models, we use their built-in budget levels (i.e., low , medium , high ). |