In-Context Learning as Conditioned Associative Memory Retrieval
Authors: Weimin Wu, Teng-Yun Hsiao, Jerry Yao-Chieh Hu, Wenxin Zhang, Han Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also give explanations for three key behaviors of ICL and validate them through experiments. ... Our experiments cover both synthetic and real-world tasks: (i) Following (Garg et al., 2022), we use the GPT-2 model on synthetic problems, including linear regression, decision trees, and 2-layer neural networks; (ii) We use the GPT-J model on a sentiment classification task. |
| Researcher Affiliation | Academia | 1Center for Foundation Models and Generative AI, Northwestern University, USA 2Department of Computer Science, Northwestern University, USA 3Department of Physics, National Taiwan University, Taiwan 4Department of Statistics and Data Science, Northwestern Universit, USA. |
| Pseudocode | No | No clearly labeled pseudocode or algorithm blocks are present in the paper. The methodology is described in paragraph form and mathematical formulations. |
| Open Source Code | No | No explicit statement about releasing code or a link to a code repository is provided in the paper. |
| Open Datasets | Yes | Our experiments cover both synthetic and real-world tasks: ... (ii) We use the GPT-J model on a sentiment classification task... We use the sentiment classification task with the Tweet Eval: Hate Speech Detection dataset (Basile et al., 2019)... and (ii) out-of-distribution (OOD) in-context examples from the CC-News corpus (Nagel, 2016) |
| Dataset Splits | Yes | To construct one sample in a batch, we use the following steps: (i). Sample linear regression coefficient βi R20 from N(0, I). (ii). Generate queries xi,j from the Gaussian mixture model ω1N( 2, I)+ω2N(2, I), where ω1 = 1, ω2 = 0 in the pre-training. Then we formalize {xi,j}k j=1, where k = 50. (iii). For each query xi,j, use yi,j = βT i xi,j to calculate the true response... The pre-training process iterates for 500k steps. ... During testing, we generate samples similar to the pre-training process. The batch size is 64, and the number of batch is 100, i.e., we have 6400 samples totally. For each in-context length j [75], we calculate the R-squared between the estimation and true value for all 6400 samples. |
| Hardware Specification | Yes | We implement experiments on 1 NVIDIA A100 80GB GPU. |
| Software Dependencies | No | The paper mentions using 'GPT-2 model' and 'GPT-J model' but does not specify versions for these models or any other software libraries or frameworks used (e.g., PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | Following the pre-training method in (Garg et al., 2022), we use the batch size as 64. ... The pre-training process iterates for 500k steps. ... we define the target function as f(x) = βT x, β Rd, where d = 20. The distribution of x Rd is from a Gaussian Mixture model ω1N( 2, I) + ω2N(2, I), where ω1 = 1, ω2 = 0 in the pre-training. ... For the decision tree, we consider the function f as a decision tree with 20-dimensional inputs and a depth of 4. ... For the 2-layer neural network, we consider ReLU neural networks. We set each function f as f(x) = Pr i=1 αiσ(w i x), where αi R, wi Rd, and σ( ) = max(0, ) is the ReLU activation function. We draw network parameters αi and wi from N(0, 2/r) and N(0, Id). We use the number of hidden nodes r as 100. |