reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revisiting In-context Learning Inference Circuit in Large Language Models

Authors: Hakaze Cho, Mariko Kato, Yoshihiro Sakai, Naoya Inoue

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically find evidence for the existence of each proposed step in LLMs, and conduct more fine-grained measurements to gain insights into some phenomena observed in ICL scenarios, such as (1) positional bias: the prediction is more influenced by the latter demonstration (Zhao et al., 2021), (2) noise robustness: the prediction is not easy to be affected by demonstrations with wrong (noisy) labels (Min et al., 2022), while larger models are less robust to label noise (Wei et al., 2023), and (3) demonstration saturation: the accuracy improvements plateau when sufficient demonstrations are given (Agarwal et al., 2024; Bertsch et al., 2024), etc. (discussed in 5.3).
Researcher Affiliation	Academia	1Japan Advanced Institute of Science and Technology 2RIKEN Primary Contributor, Correspondence to: EMAIL
Pseudocode	No	The paper describes methods through narrative text and diagrams (e.g., Figure 1), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The official code implementation of this paper by the author can be found at https://github.com/hc495/ICL_Circuit. Please follow the instructions in this Git Hub repository to reproduce the experiments.
Open Datasets	Yes	Datasets. We build ICL-formed test inputs from 6 real-world sentence classification datasets, and unless specified, we report the average results on them: SST-2 (Socher et al., 2013), MR (Pang & Lee, 2005), Financial Phrasebank (Malo et al., 2014), SST-5 (Socher et al., 2013), TREC (Li & Roth, 2002; Hovy et al., 2001), and AGNews (Zhang et al., 2015).
Dataset Splits	Yes	For each dataset, we randomly sample 512 test data points and assign one fixed demonstration sequence for each test sample to form a test input. In our experiments, we set training sample number m 256, similarity function a, b a b 2.
Hardware Specification	No	The paper mentions using pre-trained LLMs such as Llama 3 (8B, 70B) and Falcon (7B, 40B), and applying INT4 quantization, but does not specify the actual hardware (e.g., GPU models, CPU types) on which these experiments were run.
Software Dependencies	No	In our experiments, we use Bits And Bytes9 to quantize Llama 3 70B and Falcon 40B to INT4. For the other models, full-precision inference is conducted. (footnote 9: https://huggingface.co/docs/bitsandbytes/main/en/index). While Bits And Bytes is mentioned, a specific version number is not provided, and other key software dependencies (e.g., Python, PyTorch) and their versions are not listed.
Experiment Setup	Yes	Unless specified, we use k = 4 demonstrations in ICL inputs. For each dataset, we randomly sample 512 test data points and assign one fixed demonstration sequence for each test sample to form a test input. About the prompt templates, etc., please refer to Appendix A.1. In our experiments, we set training sample number m 256, similarity function a, b a b 2.