reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Inverse Scaling in Test-Time Compute

Authors: Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer...
Researcher Affiliation	Collaboration	Aryo Pradipta Gema EMAIL Anthropic Fellows Program, University of Edinburgh Alexander Hägele Anthropic Fellows Program, EPFL Runjin Chen Anthropic Fellows Program, University of Texas at Austin Andy Arditi Anthropic Fellows Program Jacob Goldman-Wetzler Anthropic Fellows Program Kit Fraser-Taliente Anthropic Fellows Program Henry Sleight Constellation Linda Petrini Independent Julian Michael Scale AI Beatrice Alex University of Edinburgh Pasquale Minervini University of Edinburgh, Miniml.AI Yanda Chen Anthropic Joe Benton Anthropic Ethan Perez EMAIL Anthropic
Pseudocode	No	The paper describes experimental setups and methods in paragraph text and figures, but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Our code and demo are available at https://safety-research.github.io/inverse-scaling-ttc. ... Our code is available at https://github.com/safety-research/inverse-scaling-ttc
Open Datasets	Yes	Our evaluation framework consists of 21 tasks across two categories, as detailed in Table 3. The dataset is publicly available in https://huggingface.co/datasets/inverse-scaling-ttc/ inverse-scaling-ttc-main and includes a unique identifier string (i.e., canary string) to prevent this evaluation data from being inadvertently used to train future models.
Dataset Splits	Yes	We select 500 students from the original dataset as test instances and evaluate each under three conditions: zero-shot, 8-shot, and 16-shot settings. In the zero-shot setting, we aim to understand how extended reasoning affects models priors about relationships, testing whether models maintain reasonable assumptions (e.g., study hours matter for grades) or shift to plausible but incorrect features under extended reasoning. In the few-shot settings, we test whether models can learn to focus on genuinely predictive features when provided with few-shot examples, or if they remain susceptible to spurious correlations despite access to ground-truth data. Each few-shot example consists of a student s features paired with their grade, randomly sampled from the remaining students to avoid overlap with the test instance.
Hardware Specification	Yes	We use 8 NVIDIA H200s to run the open-weight models.
Software Dependencies	No	The paper mentions using 'safety-tooling' and 'vLLM' libraries in Appendix A.3, but only 'safety-tooling' provides a version number (v1.0.0) in its reference entry, while 'vLLM' does not, which is insufficient for multiple versioned software components.
Experiment Setup	Yes	For both setups, we use a default temperature of 1.0 for Claude and Open AI models and the recommended 0.6 for open-weight models. We run multiple trials to ensure robust sampling: three repetitions per budget condition for controlled overthinking experiments and five repetitions for natural overthinking experiments. ... For Claude and open-weight models, we specify an integer denoting the maximum number of tokens the model should use to reason (e.g., 0 , 1,024 , 2,048 , 4,096 ), while for o-series models, we use their built-in budget levels (i.e., low , medium , high ).