reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rethinking Invariance in In-context Learning

Authors: Lizhe Fang, Yifei Wang, Khashayar Gatmiry, Lei Fang, Yisen Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our findings reveal that Inv ICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at https://github.com/PKU-ML/Inv ICL. ... We undertake a comprehensive exploration into designing invariant ICL algorithms... Empirically, Inv ICL indeed achieves superior performance across a range of tasks on both synthetic and real-world datasets. ... 4 EXPERIMENTS
Researcher Affiliation	Academia	Lizhe Fang1 Yifei Wang2 Khashayar Gatmiry2 Lei Fang3 Yisen Wang1,4 1 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 MIT CSAIL 3 School of Economics, Peking University 4 Institute for Artificial Intelligence, Peking University
Pseudocode	Yes	Algorithm 1 Invariant In-context Learning Require: {(h(0) xi , h(0) yi )}n i=1: embedding of context examples; h(0) xt : embedding of the ICL query 1: for k = 1 to #Transformer Layers do 2: for i = 1 to n do 3: Compute the independent encoding of context examples: ( h(k) xi , h(k) yi ) = aggr{( h(k 1) xi , h(k 1) yi )} (where h(0) xi = h(0) xi ) 4: end for 5: for i = 1 to n do 6: Compute the leave-one-out pre-encoding of the i-th context example: (h(k) xi , h(k) yi ) = aggr{{( h(k 1) xj , h(k 1) yj )}j =i, h(k 1) xi } 7: end for 8: Update h(k) xt = aggr{{(h(k 1) xi , h(k 1) yi )}n i=1} 9: end for
Open Source Code	Yes	Code is available at https://github.com/PKU-ML/Inv ICL.
Open Datasets	Yes	In this part, we conduct experiments to evaluate the capacity of Inv ICL on real-world datasets. Since ICL tasks are generally different from the pertaining one and some ICL methods introduce new masking schemes for aggregation (significantly different from the masking in pretrained model), a short finetuning of the pretrained model on the ICL tasks using these new ICL methods is necessary to fully utilize the pretrained model s capacity for ICL (Min et al., 2022b; Wei et al., 2021; Iyer et al., 2022; Cai et al., 2023). Here, we follow Meta ICL (Min et al., 2022b) to do the short finetuning and evaluation. As in Meta ICL, we utilize 142 tasks including text classification, question answering (QA), natural language inference (NLI), and paraphrase detection.
Dataset Splits	No	The models are trained with a sequence length of 40. As shown in Figure 3(b), when the test sequence length exceeds 40, it is clearly that Inv ICL > AR ICL > Prefix ICL No PE > Bo E ICL in terms of performance. This indicates the strong length generalization capability of Inv ICL. On one hand, this result confirms the conventional conclusion that a model that respects the data symmetry enjoys better generalization capability. ... For each training iteration, we first sample a task Ti from the C meta-training tasks, and then sample k+1 training examples (x1, y1), ..., (xk+1, yk+1) from Ti. ... The paper describes how examples are sampled for training and how models are evaluated, but does not provide explicit train/test/validation splits for a fixed dataset.
Hardware Specification	No	We find that when the inputs size of the GPT-2 Large model increases from 512 to 1024, the GPU memory overhead increases by 14% (from 4.2 GB to 4.8GB). We consider this acceptable given the clear improvements in performance.
Software Dependencies	No	The architecture selection follows (Garg et al., 2022), where a 12-layer GPT-like Transformer decoder is utilized.
Experiment Setup	Yes	For the linear regression task for example, we train a model to perform linear regression using in-context learning... Detailed experimental settings are provided in Appendix A.3. ... (a) 50k Epochs (b) 200k Epochs. ... In Appendix A.3, we set d = 20, k = 40, DX = N(0, Id), and DG : w N(0, Id), b = 0. ... The training objective is min θ EDG,DX P k i=0 ℓ(ˆg(xi), g(xi)) where ℓis the MSE loss. ... For each training iteration, we first sample a task Ti from the C meta-training tasks, and then sample k+1 training examples (x1, y1), ..., (xk+1, yk+1) from Ti. Given the model parameter θ, the training objective is maximizing prediction accuracy of yk+1 under the formatting of ICL: maxθ LCE(ˆyk+1, yk+1), where LCE is the cross-entropy loss, and ˆyk+1 is the in-context prediction defined in Eq. (1). We adopt 8 context examples for training and evaluation.