Competition Dynamics Shape Algorithmic Phases of In-Context Learning

Authors: Core Francisco Park, Ekdeep Singh Lubana, Hidenori Tanaka

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Systematically varying these factors, we show our proposed task turns out to be extremely rich, reproducing most known phenomenology of ICL and hence offering a unified setup for studying the concept. Building on this, we start to deconstruct how a model trained on our task performs ICL, finding that there exist (at least) four broad algorithms that can explain the model s behavior. These algorithms combine a fuzzy retrieval vs. inference approach with either unigram or bigram statistics of the context, and, as we show, engage in a competition dynamics with each other to dictate model behavior. Interestingly, we find the precise experimental conditions (e.g., amount of training and context-size) decide which algorithm wins the competition, hence yielding several phases in the model s ability to perform ICL: varying experimental conditions elicits different algorithmic behaviors (at times rather abruptly), making it difficult for understanding of ICL derived in one configuration to help predict model behavior in another one.
Researcher Affiliation Collaboration 1CBS-NTT Program in Physics of Intelligence, Harvard University 2Department of Physics, Harvard University 3Physics & Informatics Laboratories, NTT Research, Inc. 4EECS Department, University of Michigan, Ann Arbor
Pseudocode Yes To make the dataset clear we attach a minimal Py Torch (Paszke et al., 2019) implementation. We generated this code with the help of GPT4o (Open AI, 2024). 1 import torch 2 from torch.utils.data import Iterable Dataset 3 import numpy as np 5 class Markov Chain Dataset(Iterable Dataset): 6 def __init__(self, k, l, n, seed=None): 7 self.k = k # Number of states 8 self.l = l # Output sequence length 9 self.n = n # Number of transition matrices 10 self.seed = seed # Seed for reproducibility 12 if seed is not None: 13 np.random.seed(seed) 14 torch.manual_seed(seed) 16 # Generate n transition matrices, each with k states 17 self.transition_matrices = [] 18 for _ in range(n): 19 matrix = np.array([np.random.dirichlet([1] * k) for _ in range(k)]) 20 self.transition_matrices.append(matrix) 22 # Prior distribution over transition matrices 23 self.prior = np.random.dirichlet([1] * n) 25 def get_stationary_distribution(self, matrix): 26 # Compute the stationary distribution of the transition matrix 27 eigvals, eigvecs = np.linalg.eig(matrix.T) 28 stationary = np.real(eigvecs[:, np.isclose(eigvals, 1)]) 29 stationary = stationary[:, 0] 30 stationary /= stationary.sum() 31 return stationary 33 def sample_chain(self, transition_matrix): 34 # Sample the first state from the stationary distribution 35 stationary_distribution = self.get_stationary_distribution( transition_matrix) 36 first_state = np.random.choice(self.k, p=stationary_distribution) 38 # Generate the sequence 39 sequence = [first_state] 40 for _ in range(1, self.l): 41 current_state = sequence[-1] 42 next_state = np.random.choice(self.k, p=transition_matrix[ current_state]) 43 sequence.append(next_state) 45 return sequence 47 def __iter__(self): 48 while True: 49 # Choose a transition matrix based on the prior 50 matrix_index = np.random.choice(self.n, p=self.prior) 51 chosen_matrix = self.transition_matrices[matrix_index] 53 # Generate a sequence using the chosen transition matrix 54 sequence = self.sample_chain(chosen_matrix) 55 yield torch.tensor(sequence) 57 # Example usage: 58 # k = 3 (states), l = 10 (sequence length), n = 5 (transition matrices), 59 dataset = Markov Chain Dataset(k=3, l=10, n=5, seed=42) 60 iterator = iter(dataset) 62 # Get a sample sequence 63 sample_sequence = next(iterator) 64 print(sample_sequence)#e.g. tensor([8, 8, 9, 5, 3, 4, 8, 8]) Listing 1: Markov Mixtures Data Generating Process
Open Source Code Yes All code used to run the experiments and analysis are available at https://github.com/cfpark00/markov-mixtures.
Open Datasets Yes We propose a novel sequence modeling task that involves learning to simulate a finite mixture of Markov chains. To make the dataset clear we attach a minimal Py Torch (Paszke et al., 2019) implementation. We generated this code with the help of GPT4o (Open AI, 2024). Listing 1: Markov Mixtures Data Generating Process
Dataset Splits No The sampling process is repeated every step of training, i.e., the model is unlikely to see the same sequence twice during training (a.k.a. online training). Evaluation. We evaluate trained models on In-Distribution (ID) and Out-Of-Distribution (OOD) chains. For ID evaluations, we select the transition matrix T from the training set Ttrain and for OOD evaluations, we sample a novel T using a Dirichlet prior.
Hardware Specification Yes We trained our models on NVIDIA A100 80 GB GPUs, running 5 experiments in parallel on the same GPU, and hence yielding a wallclock time of 3.5 hours.
Software Dependencies No We adapted code from nano GPT (Karpathy, 2022), and implemented Rotational Positional Embedding (Ro PE) (Su et al., 2023) instead of the default learning positional embedding, which significantly delayed the emergence of Bi-Inf. To make the dataset clear we attach a minimal Py Torch (Paszke et al., 2019) implementation. We generated this code with the help of GPT4o (Open AI, 2024).
Experiment Setup Yes We used the Adam W optimizer (Loshchilov & Hutter, 2019) with learning rate 6 10 4. We kept the estimated FLOPs constant to 1 1016, where we estimate the compute by 6DN (D denotes the number of tokens seen during training and N denotes the number of model parameters). Using a batch size of 128 resulted in a training of 107978 steps.