Does learning the right latent variables necessarily improve in-context learning?
Authors: Sarthak Mittal, Eric Elmoznino, Leo Gagnon, Sangnie Bhardwaj, Guillaume Lajoie, Dhanya Sridhar
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our goal is to use both synthetic and real tasks that capture the key elements of ICL applications to tease apart the effects of implicit and explicit models on both in-distribution (ID) and out-of-distribution (OOD) generalization. We conduct experiments across a variety of settings that admit task latents, from synthetic regression and classification to reasoning problems. |
| Researcher Affiliation | Collaboration | 1Mila Quebec AI Institute 2Universit e de Montr eal 3Google Deep Mind. Correspondence to: Sarthak Mittal <EMAIL>, Guillaume Lajoie <EMAIL>. |
| Pseudocode | No | The paper describes procedures and architectures using text and diagrams (Figure 1, Figure 10, Figure 16) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code or a link to a code repository. |
| Open Datasets | Yes | We use Perturb-seq dataset collected by Norman et al. (2019) where researchers performed several genetic intervention experiments using CRISPR (Gilbert et al., 2014). |
| Dataset Splits | Yes | For our experiments, the number of context points n is uniformly sampled from 16 to 128 for both training and evaluation. For OOD evaluation, we test two different cases depending on the kind of task. For our synthetic regression and classification tasks, the task latent z and context samples D are sampled from the same distribution as at training time, but the queries x are sampled from a Gaussian distribution with higher (3 ) standard deviation. For our reasoning-based problems, we evaluate on task latents z that were not seen at training. |
| Hardware Specification | Yes | We train most of our models on single RTX8000 NVIDIA GPUs, where it takes roughly 3-6 hours for each experiment to run. Our scaling experiments on the other hand often required 1-2 days on single GPUs for training each model. |
| Software Dependencies | No | The paper mentions using the Adam optimizer (Kingma & Ba, 2014) but does not specify version numbers for any software libraries or programming languages. |
| Experiment Setup | Yes | All the models were trained with a learning rate of 10 4 using the Adam optimizer (Kingma & Ba, 2014) for a 1000 epochs. For our experiments, the number of context points n is uniformly sampled from 16 to 128 for both training and evaluation. |