reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Measuring In-Context Computation Complexity via Hidden State Prediction

Authors: Vincent Herrmann, Róbert Csordás, Jürgen Schmidhuber

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide empirical evidence, using both smaller-scale transformers/RNNs and large pre-trained language models, that our hidden-state unpredictability metric correlates with task complexity in various domains.
Researcher Affiliation	Academia	1IDSIA/USI/SUPSI, Lugano, Switzerland 2Stanford University, Stanford, USA 3KAUST, Thuwal, Saudi-Arabia.
Pseudocode	No	The paper describes methods and procedures in narrative text and diagrams (Figure 2), but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is publicly available (https://github.com/vincentherrmann/ predicting-hidden-states).
Open Datasets	Yes	We use the MATH dataset (Hendrycks et al., 2021), which consists of math problems labeled from Level 1 (easy) to Level 5 (hard), along with detailed reasoning solutions. Specifically, we look at the correctness of generated answers to questions from GSM-8k (Cobbe et al., 2021), a dataset of grade-school math problems. The models are trained on a mixture of natural language data from the Slim Pajama dataset, the MATH training set and the GSM-8k training set.
Dataset Splits	Yes	Each sequence consists of 10 20 examples from one PFA. To make training more robust, for half of the examples, we randomly perturb 20% of the tokens. During testing we use no perturbation. The models are trained on a mixture of natural language data from the Slim Pajama dataset, the MATH training set and the GSM-8k training set.
Hardware Specification	Yes	We also thank NVIDIA Corporation for donating DGX machines as part of the Pioneers of AI Research Award.
Software Dependencies	No	The paper mentions using the Adam optimizer but does not specify programming languages, deep learning frameworks, or other libraries with version numbers.
Experiment Setup	Yes	We train both the Transformer and the LSTM model for 30, 000 steps using the Adam optimizer, a batch size of 16 and gradient norm clipping of 1.0. The learning rate is 0.0003, with a 500 step linear warm-up from zero and no decay. They are trained for 10, 000 steps using the Adam optimizer, a batch size of 2 and gradient norm clipping of 1.0. The learning rate is 0.0001, with no warm-up or decay.