Implicit In-context Learning
Authors: Zhuowei Li, Zihao Xu, Ligong Han, Yunhe Gao, Song Wen, Di Liu, Hao Wang, Dimitris Metaxas
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluation on nine real-world tasks across three model architectures demonstrates that I2CL achieves few-shot level performance at zero-shot inference cost, and it exhibits robustness against variations in demonstration examples. Furthermore, I2CL facilitates a novel representation of task-ids , enhancing task similarity detection and fostering effective transfer learning. We also perform a comprehensive analysis and ablation study on I2CL, offering deeper insights into its internal mechanisms. |
| Researcher Affiliation | Academia | Department of Computer Science Rutgers University |
| Pseudocode | Yes | Algorithm 1 details the transfer learning method proposed for I2CL. |
| Open Source Code | Yes | Code is available at https://github.com/Lz Vv123456/I2CL. |
| Open Datasets | Yes | We first take the four tasks used in Wang et al. (2023), including sentiment analysis: SST2 (Socher et al., 2013), emotion classification: Emo C (Chatterjee et al., 2019), question classification: TREC (Voorhees & Tice, 2000), and topic classification: AGNews (Zhang et al., 2015). We then enrich our experiments with five additional datasets, encompassing 5-way sentiment analysis: SST5 (Socher et al., 2013), movie review classification: MR (Pang & Lee, 2005), 14-way topic classification: DBPedia (Zhang et al., 2015), subjectivity status categorization: Subj (Pang & Lee, 2004), and hate speech detection: hate_speech18 (de Gibert et al., 2018). We employ the Hugging Face version of the data (Lhoest et al., 2021) and sample 500 data points from the validation/test set for evaluation. |
| Dataset Splits | Yes | For each task, we randomly sample five demonstration examples per class4 following the practice described in Wang et al. (2023) to avoid majority label bias (Zhao et al., 2021), and provide a fairly strong few-shot performance. No instruction is further applied to describe the task. Input sequences are formed using simple manually designed templates (included in Appendix A). For evaluation, we report the macro-average accuracy across nine tasks, computed under five random seeds. For the calibration process, we optimize linear coefficients for 100 epochs on the same demonstration set using the Adam W (Loshchilov & Hutter, 2019) optimizer. The learning rate starts at 1 10 2 and anneals to 1 10 5 according to a cosine scheduler. This calibration profile is applied uniformly across all architectures and tasks without tailoring. |
| Hardware Specification | Yes | Concretely, we initialize λ = 0.1, β = 1.0 to promote a modest initial addition of information, and update these coefficients by minimizing the perplexity of label tokens: (x,y) D log P(y | x, v, c), (5) where P( ) denotes the induced probability distribution over the entire vocabulary at the end token position from the last layer. To bolster the robustness and adaptability of estimated linear coefficients to potential downstream variations, we perturb the residual streams with Gaussian noises η N(0, I) during the calibration phase: ot l = rt l 1 + (λa l ae l + βa l at l), ot l = ot l + γ||ot l||2 η, rt l = ot l + (λm l me l + βm l mt l), rt l = rt l + γ||rt l||2 η, where γ is a scalar employed to modulate the intensity of the noise, and || ||2 denotes the L2 norm. The o represents the intermediate state of a residual stream. Given above formulations, only a few linear coefficients (totaling 4L) are updated during the calibration phase, rendering this process remarkably efficient (consuming 1-2 minutes on a single A100 40G). |
| Software Dependencies | No | The paper mentions using 'Adam W (Loshchilov & Hutter, 2019) optimizer' and 'Hugging Face PEFT library' but does not specify exact version numbers for these or other software components like Python or PyTorch. |
| Experiment Setup | Yes | For the calibration process, we optimize linear coefficients for 100 epochs on the same demonstration set using the Adam W (Loshchilov & Hutter, 2019) optimizer. The learning rate starts at 1 10 2 and anneals to 1 10 5 according to a cosine scheduler. This calibration profile is applied uniformly across all architectures and tasks without tailoring. |