reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Implicit In-context Learning

Authors: Zhuowei Li, Zihao Xu, Ligong Han, Yunhe Gao, Song Wen, Di Liu, Hao Wang, Dimitris Metaxas

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluation on nine real-world tasks across three model architectures demonstrates that I2CL achieves few-shot level performance at zero-shot inference cost, and it exhibits robustness against variations in demonstration examples. Furthermore, I2CL facilitates a novel representation of task-ids , enhancing task similarity detection and fostering effective transfer learning. We also perform a comprehensive analysis and ablation study on I2CL, offering deeper insights into its internal mechanisms.
Researcher Affiliation	Academia	Department of Computer Science Rutgers University
Pseudocode	Yes	Algorithm 1 details the transfer learning method proposed for I2CL.
Open Source Code	Yes	Code is available at https://github.com/Lz Vv123456/I2CL.
Open Datasets	Yes	We first take the four tasks used in Wang et al. (2023), including sentiment analysis: SST2 (Socher et al., 2013), emotion classification: Emo C (Chatterjee et al., 2019), question classification: TREC (Voorhees & Tice, 2000), and topic classification: AGNews (Zhang et al., 2015). We then enrich our experiments with five additional datasets, encompassing 5-way sentiment analysis: SST5 (Socher et al., 2013), movie review classification: MR (Pang & Lee, 2005), 14-way topic classification: DBPedia (Zhang et al., 2015), subjectivity status categorization: Subj (Pang & Lee, 2004), and hate speech detection: hate_speech18 (de Gibert et al., 2018). We employ the Hugging Face version of the data (Lhoest et al., 2021) and sample 500 data points from the validation/test set for evaluation.
Dataset Splits	Yes	For each task, we randomly sample five demonstration examples per class4 following the practice described in Wang et al. (2023) to avoid majority label bias (Zhao et al., 2021), and provide a fairly strong few-shot performance. No instruction is further applied to describe the task. Input sequences are formed using simple manually designed templates (included in Appendix A). For evaluation, we report the macro-average accuracy across nine tasks, computed under five random seeds. For the calibration process, we optimize linear coefficients for 100 epochs on the same demonstration set using the Adam W (Loshchilov & Hutter, 2019) optimizer. The learning rate starts at 1 10 2 and anneals to 1 10 5 according to a cosine scheduler. This calibration profile is applied uniformly across all architectures and tasks without tailoring.
Hardware Specification	Yes	Concretely, we initialize λ = 0.1, β = 1.0 to promote a modest initial addition of information, and update these coefficients by minimizing the perplexity of label tokens: (x,y) D log P(y \| x, v, c), (5) where P( ) denotes the induced probability distribution over the entire vocabulary at the end token position from the last layer. To bolster the robustness and adaptability of estimated linear coefficients to potential downstream variations, we perturb the residual streams with Gaussian noises η N(0, I) during the calibration phase: ot l = rt l 1 + (λa l ae l + βa l at l), ot l = ot l + γ\|\|ot l\|\|2 η, rt l = ot l + (λm l me l + βm l mt l), rt l = rt l + γ\|\|rt l\|\|2 η, where γ is a scalar employed to modulate the intensity of the noise, and \|\| \|\|2 denotes the L2 norm. The o represents the intermediate state of a residual stream. Given above formulations, only a few linear coefficients (totaling 4L) are updated during the calibration phase, rendering this process remarkably efficient (consuming 1-2 minutes on a single A100 40G).
Software Dependencies	No	The paper mentions using 'Adam W (Loshchilov & Hutter, 2019) optimizer' and 'Hugging Face PEFT library' but does not specify exact version numbers for these or other software components like Python or PyTorch.
Experiment Setup	Yes	For the calibration process, we optimize linear coefficients for 100 epochs on the same demonstration set using the Adam W (Loshchilov & Hutter, 2019) optimizer. The learning rate starts at 1 10 2 and anneals to 1 10 5 according to a cosine scheduler. This calibration profile is applied uniformly across all architectures and tasks without tailoring.