reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Does learning the right latent variables necessarily improve in-context learning?

Authors: Sarthak Mittal, Eric Elmoznino, Leo Gagnon, Sangnie Bhardwaj, Guillaume Lajoie, Dhanya Sridhar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our goal is to use both synthetic and real tasks that capture the key elements of ICL applications to tease apart the effects of implicit and explicit models on both in-distribution (ID) and out-of-distribution (OOD) generalization. We conduct experiments across a variety of settings that admit task latents, from synthetic regression and classification to reasoning problems.
Researcher Affiliation	Collaboration	1Mila Quebec AI Institute 2Universit e de Montr eal 3Google Deep Mind. Correspondence to: Sarthak Mittal <EMAIL>, Guillaume Lajoie <EMAIL>.
Pseudocode	No	The paper describes procedures and architectures using text and diagrams (Figure 1, Figure 10, Figure 16) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing code or a link to a code repository.
Open Datasets	Yes	We use Perturb-seq dataset collected by Norman et al. (2019) where researchers performed several genetic intervention experiments using CRISPR (Gilbert et al., 2014).
Dataset Splits	Yes	For our experiments, the number of context points n is uniformly sampled from 16 to 128 for both training and evaluation. For OOD evaluation, we test two different cases depending on the kind of task. For our synthetic regression and classification tasks, the task latent z and context samples D are sampled from the same distribution as at training time, but the queries x are sampled from a Gaussian distribution with higher (3 ) standard deviation. For our reasoning-based problems, we evaluate on task latents z that were not seen at training.
Hardware Specification	Yes	We train most of our models on single RTX8000 NVIDIA GPUs, where it takes roughly 3-6 hours for each experiment to run. Our scaling experiments on the other hand often required 1-2 days on single GPUs for training each model.
Software Dependencies	No	The paper mentions using the Adam optimizer (Kingma & Ba, 2014) but does not specify version numbers for any software libraries or programming languages.
Experiment Setup	Yes	All the models were trained with a learning rate of 10 4 using the Adam optimizer (Kingma & Ba, 2014) for a 1000 epochs. For our experiments, the number of context points n is uniformly sampled from 16 to 128 for both training and evaluation.