Measuring In-Context Computation Complexity via Hidden State Prediction

Authors: Vincent Herrmann, Róbert Csordás, Jürgen Schmidhuber

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical evidence, using both smaller-scale transformers/RNNs and large pre-trained language models, that our hidden-state unpredictability metric correlates with task complexity in various domains.
Researcher Affiliation Academia 1IDSIA/USI/SUPSI, Lugano, Switzerland 2Stanford University, Stanford, USA 3KAUST, Thuwal, Saudi-Arabia.
Pseudocode No The paper describes methods and procedures in narrative text and diagrams (Figure 2), but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code is publicly available (https://github.com/vincentherrmann/ predicting-hidden-states).
Open Datasets Yes We use the MATH dataset (Hendrycks et al., 2021), which consists of math problems labeled from Level 1 (easy) to Level 5 (hard), along with detailed reasoning solutions. Specifically, we look at the correctness of generated answers to questions from GSM-8k (Cobbe et al., 2021), a dataset of grade-school math problems. The models are trained on a mixture of natural language data from the Slim Pajama dataset, the MATH training set and the GSM-8k training set.
Dataset Splits Yes Each sequence consists of 10 20 examples from one PFA. To make training more robust, for half of the examples, we randomly perturb 20% of the tokens. During testing we use no perturbation. The models are trained on a mixture of natural language data from the Slim Pajama dataset, the MATH training set and the GSM-8k training set.
Hardware Specification Yes We also thank NVIDIA Corporation for donating DGX machines as part of the Pioneers of AI Research Award.
Software Dependencies No The paper mentions using the Adam optimizer but does not specify programming languages, deep learning frameworks, or other libraries with version numbers.
Experiment Setup Yes We train both the Transformer and the LSTM model for 30, 000 steps using the Adam optimizer, a batch size of 16 and gradient norm clipping of 1.0. The learning rate is 0.0003, with a 500 step linear warm-up from zero and no decay. They are trained for 10, 000 steps using the Adam optimizer, a batch size of 2 and gradient norm clipping of 1.0. The learning rate is 0.0001, with no warm-up or decay.