reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

In-Context Learning as Conditioned Associative Memory Retrieval

Authors: Weimin Wu, Teng-Yun Hsiao, Jerry Yao-Chieh Hu, Wenxin Zhang, Han Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also give explanations for three key behaviors of ICL and validate them through experiments. ... Our experiments cover both synthetic and real-world tasks: (i) Following (Garg et al., 2022), we use the GPT-2 model on synthetic problems, including linear regression, decision trees, and 2-layer neural networks; (ii) We use the GPT-J model on a sentiment classification task.
Researcher Affiliation	Academia	1Center for Foundation Models and Generative AI, Northwestern University, USA 2Department of Computer Science, Northwestern University, USA 3Department of Physics, National Taiwan University, Taiwan 4Department of Statistics and Data Science, Northwestern Universit, USA.
Pseudocode	No	No clearly labeled pseudocode or algorithm blocks are present in the paper. The methodology is described in paragraph form and mathematical formulations.
Open Source Code	No	No explicit statement about releasing code or a link to a code repository is provided in the paper.
Open Datasets	Yes	Our experiments cover both synthetic and real-world tasks: ... (ii) We use the GPT-J model on a sentiment classification task... We use the sentiment classification task with the Tweet Eval: Hate Speech Detection dataset (Basile et al., 2019)... and (ii) out-of-distribution (OOD) in-context examples from the CC-News corpus (Nagel, 2016)
Dataset Splits	Yes	To construct one sample in a batch, we use the following steps: (i). Sample linear regression coefficient βi R20 from N(0, I). (ii). Generate queries xi,j from the Gaussian mixture model ω1N( 2, I)+ω2N(2, I), where ω1 = 1, ω2 = 0 in the pre-training. Then we formalize {xi,j}k j=1, where k = 50. (iii). For each query xi,j, use yi,j = βT i xi,j to calculate the true response... The pre-training process iterates for 500k steps. ... During testing, we generate samples similar to the pre-training process. The batch size is 64, and the number of batch is 100, i.e., we have 6400 samples totally. For each in-context length j [75], we calculate the R-squared between the estimation and true value for all 6400 samples.
Hardware Specification	Yes	We implement experiments on 1 NVIDIA A100 80GB GPU.
Software Dependencies	No	The paper mentions using 'GPT-2 model' and 'GPT-J model' but does not specify versions for these models or any other software libraries or frameworks used (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	Following the pre-training method in (Garg et al., 2022), we use the batch size as 64. ... The pre-training process iterates for 500k steps. ... we define the target function as f(x) = βT x, β Rd, where d = 20. The distribution of x Rd is from a Gaussian Mixture model ω1N( 2, I) + ω2N(2, I), where ω1 = 1, ω2 = 0 in the pre-training. ... For the decision tree, we consider the function f as a decision tree with 20-dimensional inputs and a depth of 4. ... For the 2-layer neural network, we consider ReLU neural networks. We set each function f as f(x) = Pr i=1 αiσ(w i x), where αi R, wi Rd, and σ( ) = max(0, ) is the ReLU activation function. We draw network parameters αi and wi from N(0, 2/r) and N(0, Id). We use the number of hidden nodes r as 100.