reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Missing Alignment Link of In-context Learning on Sequences

Authors: Harshvardhan Agarwal, Sunita Sarawagi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a systematic analysis of the ICL capability of LLMs on Seq2Seq tasks using a formal structured language-pair. Our study reveals a critical limitation: except for very short input sequences, ICL fails to achieve consistent learning across all output positions. This exposes a fundamental weakness of modern LLMs their inability to effectively uncover the alignment between input and output sequences. Consequently, this limitation results in incomplete induction heads, which form the basis for in-context learning of new discrete mappings. To address these limitations, we propose ICATune, a method for focused fine-tuning of an LLM using in-context examples. We present a mechanistic evaluation with two accuracy probes to show how alignment emerges in middle layers of an LLM without any direct supervision. This alignment leads to an abrupt jump in the completeness of the induction heads in higher layers. We show that compared to standard fine-tuning, ICA-Tune enables more sample efficient learning and generalizes better to OOD instances.
Researcher Affiliation	Academia	1Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, India. Correspondence to: Harshvardhan Agarwal <EMAIL>, Sunita Sarawagi <EMAIL>.
Pseudocode	Yes	A pseudocode for our data generator appears in Algorithm 1. Algorithm 1 Sequence Generation Process
Open Source Code	Yes	1We release code for data generation and ICA-Tune at https://github.com/draco976/icatune
Open Datasets	Yes	To study the mechanism of ICL without fear of data contamination, we follow the practice in prior work of evaluating on new synthetic tasks. Our synthetic generator is inspired by real languages. ... A pseudocode for our data generator appears in Algorithm 1. ... 1We release code for data generation and ICA-Tune at https://github.com/draco976/icatune
Dataset Splits	Yes	For each task characterized as defined above, we generate k + 1 input-output sequence pairs {(x1, y1), . . . (xk+1, yk+1)} using Algorithm 1. We create a prompt using the first k(= 15) as in-context example, and test instance x as xk+1. ... To ensure comparability between the ICA-Tune and standard fine-tuning setup we generate a fixed dataset D of N of training examples for a task τ. Both methods sample from D. We use batch size of 1 for ICA-Tune and 16 for standard fine-tuning. This choice ensures that both the setups train on the same number of examples per training step. We generate a separate set of M = 10 examples for validation. ... For the first validation set, the input sequences x are sampled from the same CFG as the training set, ensuring an in-distribution evaluation. For the OOD validation set, x sequences are generated as random permutations of the x vocabulary, deliberately designed to deviate from the training CFG.
Hardware Specification	No	No specific hardware details (like GPU models, CPU types, or memory) are provided in the paper. The paper mentions different LLMs (e.g., LLaMA 3, GPT-4o, Claude-3.7-Sonnet, Qwen2.5-3B) but not the hardware used to run experiments with them.
Software Dependencies	No	The paper mentions using 'LLa MA 3 model', 'Lo RA (Hu et al., 2021)', and 'Adam optimizer'. However, it does not specify version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other key software components that would be necessary for reproduction.
Experiment Setup	Yes	For fine-tuning, we employ Low-Rank Adaptation (Lo RA), which efficiently adapts pre-trained language models by injecting trainable low-rank updates into specific model parameters. Lo RA Hyperparameters We use the following Lo RA configuration for the experiments: Lo RA Rank (Lo RAR): Set to 16. This rank determines the dimensionality of the low-rank decomposition applied to the weight matrices. Scaling Factor (Lo RAα): Set to 8. This hyperparameter controls the scaling of the low-rank updates during training to ensure stability and effective learning. Dropout (LORA DROPOUT): Set to 0.05 to introduce regularization and prevent overfitting in the low-rank layers. Training Configuration We fine-tune all attention parameters, specifically the Q, K, and V matrices of the transformer. We use a learning rate = 2e 4 for training. We use the Adam optimizer along with a linear decay learning rate scheduler.