In-Context Learning and Occam’s Razor

Authors: Eric Elmoznino, Tom Marty, Tejas Kasetty, Leo Gagnon, Sarthak Mittal, Mahan Fathi, Dhanya Sridhar, Guillaume Lajoie

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments are designed to illustrate the benefits of ICL in terms of fitting simple models that generalize. In Section 3.1, we compare ICL s standard next-token prediction objective to an alternative that minimizes training error alone, rather than prequential code length. Section 3.2 then compares ICL to standard gradient-based learners that minimize training error, such as SGD.
Researcher Affiliation Collaboration 1Mila Quebec AI Institute 2Universit e de Montr eal 3NVIDIA. Correspondence to: Eric Elmoznino <EMAIL>, Guillaume Lajoie <EMAIL>.
Pseudocode No The paper describes methods and processes in narrative text, without formal pseudocode or algorithm blocks.
Open Source Code Yes We make our code available at https://github.com/ 3rd Core/Prequential Code.
Open Datasets No In line with similar work studying ICL in a controlled setting (Mahankali et al., 2024; Garg et al., 2022; Aky urek et al., 2023), we use synthetically-generated tasks. Each task consists of a supervised learning dataset Di = {(x1, y1), ..., (xk, yk)}, where the labels are a (potentially stochastic) function of the input yj = f i(xj, ϵj).
Dataset Splits Yes The training was conducted on a meta-dataset consisting of 10,000 tasks, each with 1,000 data points that serve as context. We used the Adam optimizer (Kingma & Ba, 2015) with a learning rate of η = 0.0001 and a batch size of 256, without any early stopping. After meta-training, we evaluated the learners on a distinct meta-dataset of 100 tasks, each with 1,000 data points. [...] We used a meta-dataset of 10000 tasks (with 2000 data points each) split into training (80%) and validation (20%).
Hardware Specification Yes All experiments were run on GPUs with at least 32 GB of RAM, and each took less than 1 day to run on a single NVIDIA V100 with all seeds stated in figure captions.
Software Dependencies No The paper mentions the Adam optimizer (Kingma & Ba, 2015) but does not specify versions for any programming languages or libraries used for implementation.
Experiment Setup Yes We trained both the Transformer-based meta-learners (with and without bottleneck) for 50 epochs and the Mamba-based meta-learners for 120 epochs. All results were averaged across 5 different random seeds to mitigate the effect of randomness in the pipeline. The training was conducted on a meta-dataset consisting of 10,000 tasks, each with 1,000 data points that serve as context. We used the Adam optimizer (Kingma & Ba, 2015) with a learning rate of η = 0.0001 and a batch size of 256, without any early stopping.