Measuring In-Context Computation Complexity via Hidden State Prediction
Authors: Vincent Herrmann, Róbert Csordás, Jürgen Schmidhuber
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide empirical evidence, using both smaller-scale transformers/RNNs and large pre-trained language models, that our hidden-state unpredictability metric correlates with task complexity in various domains. |
| Researcher Affiliation | Academia | 1IDSIA/USI/SUPSI, Lugano, Switzerland 2Stanford University, Stanford, USA 3KAUST, Thuwal, Saudi-Arabia. |
| Pseudocode | No | The paper describes methods and procedures in narrative text and diagrams (Figure 2), but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is publicly available (https://github.com/vincentherrmann/ predicting-hidden-states). |
| Open Datasets | Yes | We use the MATH dataset (Hendrycks et al., 2021), which consists of math problems labeled from Level 1 (easy) to Level 5 (hard), along with detailed reasoning solutions. Specifically, we look at the correctness of generated answers to questions from GSM-8k (Cobbe et al., 2021), a dataset of grade-school math problems. The models are trained on a mixture of natural language data from the Slim Pajama dataset, the MATH training set and the GSM-8k training set. |
| Dataset Splits | Yes | Each sequence consists of 10 20 examples from one PFA. To make training more robust, for half of the examples, we randomly perturb 20% of the tokens. During testing we use no perturbation. The models are trained on a mixture of natural language data from the Slim Pajama dataset, the MATH training set and the GSM-8k training set. |
| Hardware Specification | Yes | We also thank NVIDIA Corporation for donating DGX machines as part of the Pioneers of AI Research Award. |
| Software Dependencies | No | The paper mentions using the Adam optimizer but does not specify programming languages, deep learning frameworks, or other libraries with version numbers. |
| Experiment Setup | Yes | We train both the Transformer and the LSTM model for 30, 000 steps using the Adam optimizer, a batch size of 16 and gradient norm clipping of 1.0. The learning rate is 0.0003, with a 500 step linear warm-up from zero and no decay. They are trained for 10, 000 steps using the Adam optimizer, a batch size of 2 and gradient norm clipping of 1.0. The learning rate is 0.0001, with no warm-up or decay. |