Trained Transformers Learn Linear Models In-Context

Authors: Ruiqi Zhang, Spencer Frei, Peter L. Bartlett

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement this finding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts. (...) We then empirically investigate the behavior of large, nonlinear transformers when trained on linear regression prompts. We find that these more complex models are able to generalize better under covariate shift, especially when trained on prompts with varying covariate distributions. (Section 4.3)
Researcher Affiliation Collaboration Ruiqi Zhang EMAIL Department of Statistics University of California, Berkeley 367 Evans Hall, Berkeley, CA 94720-3860, USA Spencer Frei EMAIL Department of Statistics University of California, Davis 4118 Mathematical Sciences Building 399 Crocker Ave., Davis, CA 95616, USA Peter L. Bartlett EMAIL Department of Statistics and Department of Electrical Engineering and Computer Sciences University of California, Berkeley 367 Evans Hall, Berkeley, CA 94720-3860, USA Google Deep Mind 1600 Amphitheatre Parkway Mountain View, CA 94040, USA
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methods are described using mathematical equations and textual explanations.
Open Source Code No Our experimental setup is based on the codebase provided by Garg et al. (2022), with a modification that allows for the possibility that the covariate distribution changes across prompts.
Open Datasets No In all experiments, covariates are sampled from a mean-zero Gaussian in d = 20 dimensions with either fixed or random covariance matrix. For the fixed covariance case, we fix the covariance matrix to be identity; for the random case, the covariance matrices are restricted to be diagonal and all diagonal entries are i.i.d. sampled from the standard exponential distribution. The linear weights in all tasks are i.i.d. sampled from standard Gaussian distribution and also independently from all covariates.
Dataset Splits No The plots in Figure 1 show the error averaged over 642 prompts, where we sample 64 covariance matrices for each curve and 64 prompts for each covariance matrix.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies No We use the standard GPT2 architecture with embedding size 256, 12 layers and 8 heads (Radford et al., 2018) as implemented by Hugging Face (Wolf et al., 2020). For the GPT2 models, we use the embedding method proposed by Garg et al. (2022) (...) We trained the model for 500000 steps using Adam (Kingma and Ba, 2014) with a batch size of 64 and learning rate of 0.0001.
Experiment Setup Yes We trained the model for 500000 steps using Adam (Kingma and Ba, 2014) with a batch size of 64 and learning rate of 0.0001. We use the same curriculum strategy of Garg et al. (2022) for acceleration. (...) We consider linear models in d = 20 dimensions and we train on prompt lengths of N = 40, 70, 100 with either fixed or random covariance matrices.