On Understanding Attention-Based In-Context Learning for Categorical Data
Authors: Aaron T Wang, William Convertino, Xiang Cheng, Ricardo Henao, Lawrence Carin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the framework empirically on synthetic data, image classification and language generation. 2. We empirically validate our framework through experiments on diverse datasets: (a) We tackle in-context image classification on Image Net (Russakovsky et al., 2014)... (b) We apply our GD-based model to language generation, training on a combined corpus of Tiny Stories and Children Stories (Eldan & Li, 2023)... |
| Researcher Affiliation | Academia | 1Electrical & Computer Engineering Dept., Duke University, Durham, NC, USA. Correspondence to: Lawrence Carin <EMAIL>. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided, but the model architecture and steps are described in prose and diagrams in Section 3 and Figures 1 and 2. |
| Open Source Code | Yes | Code needed to replicate our experiments is at https://github.com/aarontwang/icl_attention_categorical. |
| Open Datasets | Yes | We tackle in-context image classification on Image Net (Russakovsky et al., 2014)... We apply our GD-based model to language generation, training on a combined corpus of Tiny Stories and Children Stories (Eldan & Li, 2023)... 1https://huggingface.co/datasets/ajibawa-2023/Children-Stories-Collection |
| Dataset Splits | Yes | For each contextual set C(l), 5 distinct classes are selected uniformly at random, and for each such class 10 specific images are selected at random, and therefore N = 50 (image N + 1 is selected at random from the 5 class types considered in the context data). When training L = 2048, and test performance is averaged for M = 2048. |
| Hardware Specification | Yes | All experiments were performed on a Tesla V100 PCIe 16 GB GPU. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as libraries or programming languages used for their implementation. It only mentions the use of "GPT-4o model" for evaluation. |
| Experiment Setup | Yes | embedding vectors are learned for each token, with C = 50, 257 unique tokens represented and an embedding dimension d = 512; 8 attention heads are use for both models. Additionally, positional embedding vectors are learned for each of the 256 positions in our model s context window, with an additional 257th position learned for the GD model (for position x N+1). |