Revisiting In-context Learning Inference Circuit in Large Language Models

Authors: Hakaze Cho, Mariko Kato, Yoshihiro Sakai, Naoya Inoue

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically find evidence for the existence of each proposed step in LLMs, and conduct more fine-grained measurements to gain insights into some phenomena observed in ICL scenarios, such as (1) positional bias: the prediction is more influenced by the latter demonstration (Zhao et al., 2021), (2) noise robustness: the prediction is not easy to be affected by demonstrations with wrong (noisy) labels (Min et al., 2022), while larger models are less robust to label noise (Wei et al., 2023), and (3) demonstration saturation: the accuracy improvements plateau when sufficient demonstrations are given (Agarwal et al., 2024; Bertsch et al., 2024), etc. (discussed in 5.3).
Researcher Affiliation Academia 1Japan Advanced Institute of Science and Technology 2RIKEN Primary Contributor, Correspondence to: EMAIL
Pseudocode No The paper describes methods through narrative text and diagrams (e.g., Figure 1), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The official code implementation of this paper by the author can be found at https://github.com/hc495/ICL_Circuit. Please follow the instructions in this Git Hub repository to reproduce the experiments.
Open Datasets Yes Datasets. We build ICL-formed test inputs from 6 real-world sentence classification datasets, and unless specified, we report the average results on them: SST-2 (Socher et al., 2013), MR (Pang & Lee, 2005), Financial Phrasebank (Malo et al., 2014), SST-5 (Socher et al., 2013), TREC (Li & Roth, 2002; Hovy et al., 2001), and AGNews (Zhang et al., 2015).
Dataset Splits Yes For each dataset, we randomly sample 512 test data points and assign one fixed demonstration sequence for each test sample to form a test input. In our experiments, we set training sample number m 256, similarity function a, b a b 2.
Hardware Specification No The paper mentions using pre-trained LLMs such as Llama 3 (8B, 70B) and Falcon (7B, 40B), and applying INT4 quantization, but does not specify the actual hardware (e.g., GPU models, CPU types) on which these experiments were run.
Software Dependencies No In our experiments, we use Bits And Bytes9 to quantize Llama 3 70B and Falcon 40B to INT4. For the other models, full-precision inference is conducted. (footnote 9: https://huggingface.co/docs/bitsandbytes/main/en/index). While Bits And Bytes is mentioned, a specific version number is not provided, and other key software dependencies (e.g., Python, PyTorch) and their versions are not listed.
Experiment Setup Yes Unless specified, we use k = 4 demonstrations in ICL inputs. For each dataset, we randomly sample 512 test data points and assign one fixed demonstration sequence for each test sample to form a test input. About the prompt templates, etc., please refer to Appendix A.1. In our experiments, we set training sample number m 256, similarity function a, b a b 2.