The Missing Alignment Link of In-context Learning on Sequences

Authors: Harshvardhan Agarwal, Sunita Sarawagi

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a systematic analysis of the ICL capability of LLMs on Seq2Seq tasks using a formal structured language-pair. Our study reveals a critical limitation: except for very short input sequences, ICL fails to achieve consistent learning across all output positions. This exposes a fundamental weakness of modern LLMs their inability to effectively uncover the alignment between input and output sequences. Consequently, this limitation results in incomplete induction heads, which form the basis for in-context learning of new discrete mappings. To address these limitations, we propose ICATune, a method for focused fine-tuning of an LLM using in-context examples. We present a mechanistic evaluation with two accuracy probes to show how alignment emerges in middle layers of an LLM without any direct supervision. This alignment leads to an abrupt jump in the completeness of the induction heads in higher layers. We show that compared to standard fine-tuning, ICA-Tune enables more sample efficient learning and generalizes better to OOD instances.
Researcher Affiliation Academia 1Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, India. Correspondence to: Harshvardhan Agarwal <EMAIL>, Sunita Sarawagi <EMAIL>.
Pseudocode Yes A pseudocode for our data generator appears in Algorithm 1. Algorithm 1 Sequence Generation Process
Open Source Code Yes 1We release code for data generation and ICA-Tune at https://github.com/draco976/icatune
Open Datasets Yes To study the mechanism of ICL without fear of data contamination, we follow the practice in prior work of evaluating on new synthetic tasks. Our synthetic generator is inspired by real languages. ... A pseudocode for our data generator appears in Algorithm 1. ... 1We release code for data generation and ICA-Tune at https://github.com/draco976/icatune
Dataset Splits Yes For each task characterized as defined above, we generate k + 1 input-output sequence pairs {(x1, y1), . . . (xk+1, yk+1)} using Algorithm 1. We create a prompt using the first k(= 15) as in-context example, and test instance x as xk+1. ... To ensure comparability between the ICA-Tune and standard fine-tuning setup we generate a fixed dataset D of N of training examples for a task τ. Both methods sample from D. We use batch size of 1 for ICA-Tune and 16 for standard fine-tuning. This choice ensures that both the setups train on the same number of examples per training step. We generate a separate set of M = 10 examples for validation. ... For the first validation set, the input sequences x are sampled from the same CFG as the training set, ensuring an in-distribution evaluation. For the OOD validation set, x sequences are generated as random permutations of the x vocabulary, deliberately designed to deviate from the training CFG.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or memory) are provided in the paper. The paper mentions different LLMs (e.g., LLaMA 3, GPT-4o, Claude-3.7-Sonnet, Qwen2.5-3B) but not the hardware used to run experiments with them.
Software Dependencies No The paper mentions using 'LLa MA 3 model', 'Lo RA (Hu et al., 2021)', and 'Adam optimizer'. However, it does not specify version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other key software components that would be necessary for reproduction.
Experiment Setup Yes For fine-tuning, we employ Low-Rank Adaptation (Lo RA), which efficiently adapts pre-trained language models by injecting trainable low-rank updates into specific model parameters. Lo RA Hyperparameters We use the following Lo RA configuration for the experiments: Lo RA Rank (Lo RAR): Set to 16. This rank determines the dimensionality of the low-rank decomposition applied to the weight matrices. Scaling Factor (Lo RAα): Set to 8. This hyperparameter controls the scaling of the low-rank updates during training to ensure stability and effective learning. Dropout (LORA DROPOUT): Set to 0.05 to introduce regularization and prevent overfitting in the low-rank layers. Training Configuration We fine-tune all attention parameters, specifically the Q, K, and V matrices of the transformer. We use a learning rate = 2e 4 for training. We use the Adam optimizer along with a linear decay learning rate scheduler.