On Understanding Attention-Based In-Context Learning for Categorical Data

Authors: Aaron T Wang, William Convertino, Xiang Cheng, Ricardo Henao, Lawrence Carin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the framework empirically on synthetic data, image classification and language generation. 2. We empirically validate our framework through experiments on diverse datasets: (a) We tackle in-context image classification on Image Net (Russakovsky et al., 2014)... (b) We apply our GD-based model to language generation, training on a combined corpus of Tiny Stories and Children Stories (Eldan & Li, 2023)...
Researcher Affiliation Academia 1Electrical & Computer Engineering Dept., Duke University, Durham, NC, USA. Correspondence to: Lawrence Carin <EMAIL>.
Pseudocode No No explicit pseudocode or algorithm blocks are provided, but the model architecture and steps are described in prose and diagrams in Section 3 and Figures 1 and 2.
Open Source Code Yes Code needed to replicate our experiments is at https://github.com/aarontwang/icl_attention_categorical.
Open Datasets Yes We tackle in-context image classification on Image Net (Russakovsky et al., 2014)... We apply our GD-based model to language generation, training on a combined corpus of Tiny Stories and Children Stories (Eldan & Li, 2023)... 1https://huggingface.co/datasets/ajibawa-2023/Children-Stories-Collection
Dataset Splits Yes For each contextual set C(l), 5 distinct classes are selected uniformly at random, and for each such class 10 specific images are selected at random, and therefore N = 50 (image N + 1 is selected at random from the 5 class types considered in the context data). When training L = 2048, and test performance is averaged for M = 2048.
Hardware Specification Yes All experiments were performed on a Tesla V100 PCIe 16 GB GPU.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as libraries or programming languages used for their implementation. It only mentions the use of "GPT-4o model" for evaluation.
Experiment Setup Yes embedding vectors are learned for each token, with C = 50, 257 unique tokens represented and an embedding dimension d = 512; 8 attention heads are use for both models. Additionally, positional embedding vectors are learned for each of the 256 positions in our model s context window, with an additional 257th position learned for the GD model (for position x N+1).