reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

In-Context Learning and Occam’s Razor

Authors: Eric Elmoznino, Tom Marty, Tejas Kasetty, Leo Gagnon, Sarthak Mittal, Mahan Fathi, Dhanya Sridhar, Guillaume Lajoie

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments are designed to illustrate the benefits of ICL in terms of fitting simple models that generalize. In Section 3.1, we compare ICL s standard next-token prediction objective to an alternative that minimizes training error alone, rather than prequential code length. Section 3.2 then compares ICL to standard gradient-based learners that minimize training error, such as SGD.
Researcher Affiliation	Collaboration	1Mila Quebec AI Institute 2Universit e de Montr eal 3NVIDIA. Correspondence to: Eric Elmoznino <EMAIL>, Guillaume Lajoie <EMAIL>.
Pseudocode	No	The paper describes methods and processes in narrative text, without formal pseudocode or algorithm blocks.
Open Source Code	Yes	We make our code available at https://github.com/ 3rd Core/Prequential Code.
Open Datasets	No	In line with similar work studying ICL in a controlled setting (Mahankali et al., 2024; Garg et al., 2022; Aky urek et al., 2023), we use synthetically-generated tasks. Each task consists of a supervised learning dataset Di = {(x1, y1), ..., (xk, yk)}, where the labels are a (potentially stochastic) function of the input yj = f i(xj, ϵj).
Dataset Splits	Yes	The training was conducted on a meta-dataset consisting of 10,000 tasks, each with 1,000 data points that serve as context. We used the Adam optimizer (Kingma & Ba, 2015) with a learning rate of η = 0.0001 and a batch size of 256, without any early stopping. After meta-training, we evaluated the learners on a distinct meta-dataset of 100 tasks, each with 1,000 data points. [...] We used a meta-dataset of 10000 tasks (with 2000 data points each) split into training (80%) and validation (20%).
Hardware Specification	Yes	All experiments were run on GPUs with at least 32 GB of RAM, and each took less than 1 day to run on a single NVIDIA V100 with all seeds stated in figure captions.
Software Dependencies	No	The paper mentions the Adam optimizer (Kingma & Ba, 2015) but does not specify versions for any programming languages or libraries used for implementation.
Experiment Setup	Yes	We trained both the Transformer-based meta-learners (with and without bottleneck) for 50 epochs and the Mamba-based meta-learners for 120 epochs. All results were averaged across 5 different random seeds to mitigate the effect of randomness in the pipeline. The training was conducted on a meta-dataset consisting of 10,000 tasks, each with 1,000 data points that serve as context. We used the Adam optimizer (Kingma & Ba, 2015) with a learning rate of η = 0.0001 and a batch size of 256, without any early stopping.