reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context

Authors: Spencer Frei, Gal Vardi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We pre-train models using standard full-batch gradient descent on the logistic loss with R = 5 d, N = 40, learning rate η = 0.01, for 300 steps from a zero initialization, using PyTorch. The in-context training accuracy is measured... The in-context test accuracy is computed... We then average over 2500 tasks, and we plot this average with error bars corresponding to one standard error over these 2500 numbers.
Researcher Affiliation	Collaboration	Spencer Frei UC Davis EMAIL Gal Vardi Weizmann Institute of Science EMAIL Now at Google DeepMind
Pseudocode	No	The paper describes algorithms and proofs mathematically and textually, but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	Details on experiments and a link to our codebase are provided in Appendix E. code is available on Github.3 https://github.com/spencerfrei/icl classification
Open Datasets	No	We consider a restricted linear attention model, a setting considered in prior works (Wu et al., 2024; Kim et al., 2024). We assume that pre-training tasks are sampled from random instances of classconditional Gaussian mixture model data... We assume that test-time in-context examples also come from class-conditional Gaussian mixture models as above but with two important differences.
Dataset Splits	No	The paper describes the generation of synthetic data for pre-training tasks (N samples) and test-time tasks (M samples), and defines the parameters for this generation, but does not provide specific train/test/validation splits from a fixed dataset.
Hardware Specification	Yes	All computations can be run within an hour on high-quality CPU, although we used an NVIDIA RTX 3500 Ada which helped speed up the computations for the B = d 1000 setting.
Software Dependencies	No	We pre-train models using standard full-batch gradient descent on the logistic loss... using PyTorch.
Experiment Setup	Yes	We pre-train models using standard full-batch gradient descent on the logistic loss with R = 5 d, N = 40, learning rate η = 0.01, for 300 steps from a zero initialization, using PyTorch. The in-context training accuracy is measured using the definition of training accuracy from Theorem 4.2: namely, we look at what proportion of the in-context examples (training data) that is accurately predicted with the model by(E1:M τ ; W), where W is the trained transformer, for a single task τ, i.e. averaging 1(yk = sign(y(E1:M τ (xk); W))) over k = 1, ..., M. The in-context test accuracy is computed by measuring whether sign(by(E1:M τ (x M+1); W)) = y M+1. We then average over 2500 tasks, and we plot this average with error bars corresponding to one standard error over these 2500 numbers.