Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context
Authors: Spencer Frei, Gal Vardi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pre-train models using standard full-batch gradient descent on the logistic loss with R = 5 d, N = 40, learning rate η = 0.01, for 300 steps from a zero initialization, using PyTorch. The in-context training accuracy is measured... The in-context test accuracy is computed... We then average over 2500 tasks, and we plot this average with error bars corresponding to one standard error over these 2500 numbers. |
| Researcher Affiliation | Collaboration | Spencer Frei UC Davis EMAIL Gal Vardi Weizmann Institute of Science EMAIL Now at Google DeepMind |
| Pseudocode | No | The paper describes algorithms and proofs mathematically and textually, but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | Details on experiments and a link to our codebase are provided in Appendix E. code is available on Github.3 https://github.com/spencerfrei/icl classification |
| Open Datasets | No | We consider a restricted linear attention model, a setting considered in prior works (Wu et al., 2024; Kim et al., 2024). We assume that pre-training tasks are sampled from random instances of classconditional Gaussian mixture model data... We assume that test-time in-context examples also come from class-conditional Gaussian mixture models as above but with two important differences. |
| Dataset Splits | No | The paper describes the generation of synthetic data for pre-training tasks (N samples) and test-time tasks (M samples), and defines the parameters for this generation, but does not provide specific train/test/validation splits from a fixed dataset. |
| Hardware Specification | Yes | All computations can be run within an hour on high-quality CPU, although we used an NVIDIA RTX 3500 Ada which helped speed up the computations for the B = d 1000 setting. |
| Software Dependencies | No | We pre-train models using standard full-batch gradient descent on the logistic loss... using PyTorch. |
| Experiment Setup | Yes | We pre-train models using standard full-batch gradient descent on the logistic loss with R = 5 d, N = 40, learning rate η = 0.01, for 300 steps from a zero initialization, using PyTorch. The in-context training accuracy is measured using the definition of training accuracy from Theorem 4.2: namely, we look at what proportion of the in-context examples (training data) that is accurately predicted with the model by(E1:M τ ; W), where W is the trained transformer, for a single task τ, i.e. averaging 1(yk = sign(y(E1:M τ (xk); W))) over k = 1, ..., M. The in-context test accuracy is computed by measuring whether sign(by(E1:M τ (x M+1); W)) = y M+1. We then average over 2500 tasks, and we plot this average with error bars corresponding to one standard error over these 2500 numbers. |