Rethinking Invariance in In-context Learning

Authors: Lizhe Fang, Yifei Wang, Khashayar Gatmiry, Lei Fang, Yisen Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our findings reveal that Inv ICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at https://github.com/PKU-ML/Inv ICL. ... We undertake a comprehensive exploration into designing invariant ICL algorithms... Empirically, Inv ICL indeed achieves superior performance across a range of tasks on both synthetic and real-world datasets. ... 4 EXPERIMENTS
Researcher Affiliation Academia Lizhe Fang1 Yifei Wang2 Khashayar Gatmiry2 Lei Fang3 Yisen Wang1,4 1 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 MIT CSAIL 3 School of Economics, Peking University 4 Institute for Artificial Intelligence, Peking University
Pseudocode Yes Algorithm 1 Invariant In-context Learning Require: {(h(0) xi , h(0) yi )}n i=1: embedding of context examples; h(0) xt : embedding of the ICL query 1: for k = 1 to #Transformer Layers do 2: for i = 1 to n do 3: Compute the independent encoding of context examples: ( h(k) xi , h(k) yi ) = aggr{( h(k 1) xi , h(k 1) yi )} (where h(0) xi = h(0) xi ) 4: end for 5: for i = 1 to n do 6: Compute the leave-one-out pre-encoding of the i-th context example: (h(k) xi , h(k) yi ) = aggr{{( h(k 1) xj , h(k 1) yj )}j =i, h(k 1) xi } 7: end for 8: Update h(k) xt = aggr{{(h(k 1) xi , h(k 1) yi )}n i=1} 9: end for
Open Source Code Yes Code is available at https://github.com/PKU-ML/Inv ICL.
Open Datasets Yes In this part, we conduct experiments to evaluate the capacity of Inv ICL on real-world datasets. Since ICL tasks are generally different from the pertaining one and some ICL methods introduce new masking schemes for aggregation (significantly different from the masking in pretrained model), a short finetuning of the pretrained model on the ICL tasks using these new ICL methods is necessary to fully utilize the pretrained model s capacity for ICL (Min et al., 2022b; Wei et al., 2021; Iyer et al., 2022; Cai et al., 2023). Here, we follow Meta ICL (Min et al., 2022b) to do the short finetuning and evaluation. As in Meta ICL, we utilize 142 tasks including text classification, question answering (QA), natural language inference (NLI), and paraphrase detection.
Dataset Splits No The models are trained with a sequence length of 40. As shown in Figure 3(b), when the test sequence length exceeds 40, it is clearly that Inv ICL > AR ICL > Prefix ICL No PE > Bo E ICL in terms of performance. This indicates the strong length generalization capability of Inv ICL. On one hand, this result confirms the conventional conclusion that a model that respects the data symmetry enjoys better generalization capability. ... For each training iteration, we first sample a task Ti from the C meta-training tasks, and then sample k+1 training examples (x1, y1), ..., (xk+1, yk+1) from Ti. ... The paper describes how examples are sampled for training and how models are evaluated, but does not provide explicit train/test/validation splits for a fixed dataset.
Hardware Specification No We find that when the inputs size of the GPT-2 Large model increases from 512 to 1024, the GPU memory overhead increases by 14% (from 4.2 GB to 4.8GB). We consider this acceptable given the clear improvements in performance.
Software Dependencies No The architecture selection follows (Garg et al., 2022), where a 12-layer GPT-like Transformer decoder is utilized.
Experiment Setup Yes For the linear regression task for example, we train a model to perform linear regression using in-context learning... Detailed experimental settings are provided in Appendix A.3. ... (a) 50k Epochs (b) 200k Epochs. ... In Appendix A.3, we set d = 20, k = 40, DX = N(0, Id), and DG : w N(0, Id), b = 0. ... The training objective is min θ EDG,DX P k i=0 ℓ(ˆg(xi), g(xi)) where ℓis the MSE loss. ... For each training iteration, we first sample a task Ti from the C meta-training tasks, and then sample k+1 training examples (x1, y1), ..., (xk+1, yk+1) from Ti. Given the model parameter θ, the training objective is maximizing prediction accuracy of yk+1 under the formatting of ICL: maxθ LCE(ˆyk+1, yk+1), where LCE is the cross-entropy loss, and ˆyk+1 is the in-context prediction defined in Eq. (1). We adopt 8 context examples for training and evaluation.