In-Context Learning and Occam’s Razor
Authors: Eric Elmoznino, Tom Marty, Tejas Kasetty, Leo Gagnon, Sarthak Mittal, Mahan Fathi, Dhanya Sridhar, Guillaume Lajoie
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments are designed to illustrate the benefits of ICL in terms of fitting simple models that generalize. In Section 3.1, we compare ICL s standard next-token prediction objective to an alternative that minimizes training error alone, rather than prequential code length. Section 3.2 then compares ICL to standard gradient-based learners that minimize training error, such as SGD. |
| Researcher Affiliation | Collaboration | 1Mila Quebec AI Institute 2Universit e de Montr eal 3NVIDIA. Correspondence to: Eric Elmoznino <EMAIL>, Guillaume Lajoie <EMAIL>. |
| Pseudocode | No | The paper describes methods and processes in narrative text, without formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | We make our code available at https://github.com/ 3rd Core/Prequential Code. |
| Open Datasets | No | In line with similar work studying ICL in a controlled setting (Mahankali et al., 2024; Garg et al., 2022; Aky urek et al., 2023), we use synthetically-generated tasks. Each task consists of a supervised learning dataset Di = {(x1, y1), ..., (xk, yk)}, where the labels are a (potentially stochastic) function of the input yj = f i(xj, ϵj). |
| Dataset Splits | Yes | The training was conducted on a meta-dataset consisting of 10,000 tasks, each with 1,000 data points that serve as context. We used the Adam optimizer (Kingma & Ba, 2015) with a learning rate of η = 0.0001 and a batch size of 256, without any early stopping. After meta-training, we evaluated the learners on a distinct meta-dataset of 100 tasks, each with 1,000 data points. [...] We used a meta-dataset of 10000 tasks (with 2000 data points each) split into training (80%) and validation (20%). |
| Hardware Specification | Yes | All experiments were run on GPUs with at least 32 GB of RAM, and each took less than 1 day to run on a single NVIDIA V100 with all seeds stated in figure captions. |
| Software Dependencies | No | The paper mentions the Adam optimizer (Kingma & Ba, 2015) but does not specify versions for any programming languages or libraries used for implementation. |
| Experiment Setup | Yes | We trained both the Transformer-based meta-learners (with and without bottleneck) for 50 epochs and the Mamba-based meta-learners for 120 epochs. All results were averaged across 5 different random seeds to mitigate the effect of randomness in the pipeline. The training was conducted on a meta-dataset consisting of 10,000 tasks, each with 1,000 data points that serve as context. We used the Adam optimizer (Kingma & Ba, 2015) with a learning rate of η = 0.0001 and a batch size of 256, without any early stopping. |