Does learning the right latent variables necessarily improve in-context learning?

Authors: Sarthak Mittal, Eric Elmoznino, Leo Gagnon, Sangnie Bhardwaj, Guillaume Lajoie, Dhanya Sridhar

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our goal is to use both synthetic and real tasks that capture the key elements of ICL applications to tease apart the effects of implicit and explicit models on both in-distribution (ID) and out-of-distribution (OOD) generalization. We conduct experiments across a variety of settings that admit task latents, from synthetic regression and classification to reasoning problems.
Researcher Affiliation Collaboration 1Mila Quebec AI Institute 2Universit e de Montr eal 3Google Deep Mind. Correspondence to: Sarthak Mittal <EMAIL>, Guillaume Lajoie <EMAIL>.
Pseudocode No The paper describes procedures and architectures using text and diagrams (Figure 1, Figure 10, Figure 16) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not contain an explicit statement about releasing code or a link to a code repository.
Open Datasets Yes We use Perturb-seq dataset collected by Norman et al. (2019) where researchers performed several genetic intervention experiments using CRISPR (Gilbert et al., 2014).
Dataset Splits Yes For our experiments, the number of context points n is uniformly sampled from 16 to 128 for both training and evaluation. For OOD evaluation, we test two different cases depending on the kind of task. For our synthetic regression and classification tasks, the task latent z and context samples D are sampled from the same distribution as at training time, but the queries x are sampled from a Gaussian distribution with higher (3 ) standard deviation. For our reasoning-based problems, we evaluate on task latents z that were not seen at training.
Hardware Specification Yes We train most of our models on single RTX8000 NVIDIA GPUs, where it takes roughly 3-6 hours for each experiment to run. Our scaling experiments on the other hand often required 1-2 days on single GPUs for training each model.
Software Dependencies No The paper mentions using the Adam optimizer (Kingma & Ba, 2014) but does not specify version numbers for any software libraries or programming languages.
Experiment Setup Yes All the models were trained with a learning rate of 10 4 using the Adam optimizer (Kingma & Ba, 2014) for a 1000 epochs. For our experiments, the number of context points n is uniformly sampled from 16 to 128 for both training and evaluation.