Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context

Authors: Spencer Frei, Gal Vardi

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We pre-train models using standard full-batch gradient descent on the logistic loss with R = 5 d, N = 40, learning rate η = 0.01, for 300 steps from a zero initialization, using PyTorch. The in-context training accuracy is measured... The in-context test accuracy is computed... We then average over 2500 tasks, and we plot this average with error bars corresponding to one standard error over these 2500 numbers.
Researcher Affiliation Collaboration Spencer Frei UC Davis EMAIL Gal Vardi Weizmann Institute of Science EMAIL Now at Google DeepMind
Pseudocode No The paper describes algorithms and proofs mathematically and textually, but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes Details on experiments and a link to our codebase are provided in Appendix E. code is available on Github.3 https://github.com/spencerfrei/icl classification
Open Datasets No We consider a restricted linear attention model, a setting considered in prior works (Wu et al., 2024; Kim et al., 2024). We assume that pre-training tasks are sampled from random instances of classconditional Gaussian mixture model data... We assume that test-time in-context examples also come from class-conditional Gaussian mixture models as above but with two important differences.
Dataset Splits No The paper describes the generation of synthetic data for pre-training tasks (N samples) and test-time tasks (M samples), and defines the parameters for this generation, but does not provide specific train/test/validation splits from a fixed dataset.
Hardware Specification Yes All computations can be run within an hour on high-quality CPU, although we used an NVIDIA RTX 3500 Ada which helped speed up the computations for the B = d 1000 setting.
Software Dependencies No We pre-train models using standard full-batch gradient descent on the logistic loss... using PyTorch.
Experiment Setup Yes We pre-train models using standard full-batch gradient descent on the logistic loss with R = 5 d, N = 40, learning rate η = 0.01, for 300 steps from a zero initialization, using PyTorch. The in-context training accuracy is measured using the definition of training accuracy from Theorem 4.2: namely, we look at what proportion of the in-context examples (training data) that is accurately predicted with the model by(E1:M τ ; W), where W is the trained transformer, for a single task τ, i.e. averaging 1(yk = sign(y(E1:M τ (xk); W))) over k = 1, ..., M. The in-context test accuracy is computed by measuring whether sign(by(E1:M τ (x M+1); W)) = y M+1. We then average over 2500 tasks, and we plot this average with error bars corresponding to one standard error over these 2500 numbers.