reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Trained Transformers Learn Linear Models In-Context

Authors: Ruiqi Zhang, Spencer Frei, Peter L. Bartlett

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We complement this ﬁnding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts. (...) We then empirically investigate the behavior of large, nonlinear transformers when trained on linear regression prompts. We ﬁnd that these more complex models are able to generalize better under covariate shift, especially when trained on prompts with varying covariate distributions. (Section 4.3)
Researcher Affiliation	Collaboration	Ruiqi Zhang EMAIL Department of Statistics University of California, Berkeley 367 Evans Hall, Berkeley, CA 94720-3860, USA Spencer Frei EMAIL Department of Statistics University of California, Davis 4118 Mathematical Sciences Building 399 Crocker Ave., Davis, CA 95616, USA Peter L. Bartlett EMAIL Department of Statistics and Department of Electrical Engineering and Computer Sciences University of California, Berkeley 367 Evans Hall, Berkeley, CA 94720-3860, USA Google Deep Mind 1600 Amphitheatre Parkway Mountain View, CA 94040, USA
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methods are described using mathematical equations and textual explanations.
Open Source Code	No	Our experimental setup is based on the codebase provided by Garg et al. (2022), with a modiﬁcation that allows for the possibility that the covariate distribution changes across prompts.
Open Datasets	No	In all experiments, covariates are sampled from a mean-zero Gaussian in d = 20 dimensions with either ﬁxed or random covariance matrix. For the ﬁxed covariance case, we ﬁx the covariance matrix to be identity; for the random case, the covariance matrices are restricted to be diagonal and all diagonal entries are i.i.d. sampled from the standard exponential distribution. The linear weights in all tasks are i.i.d. sampled from standard Gaussian distribution and also independently from all covariates.
Dataset Splits	No	The plots in Figure 1 show the error averaged over 642 prompts, where we sample 64 covariance matrices for each curve and 64 prompts for each covariance matrix.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies	No	We use the standard GPT2 architecture with embedding size 256, 12 layers and 8 heads (Radford et al., 2018) as implemented by Hugging Face (Wolf et al., 2020). For the GPT2 models, we use the embedding method proposed by Garg et al. (2022) (...) We trained the model for 500000 steps using Adam (Kingma and Ba, 2014) with a batch size of 64 and learning rate of 0.0001.
Experiment Setup	Yes	We trained the model for 500000 steps using Adam (Kingma and Ba, 2014) with a batch size of 64 and learning rate of 0.0001. We use the same curriculum strategy of Garg et al. (2022) for acceleration. (...) We consider linear models in d = 20 dimensions and we train on prompt lengths of N = 40, 70, 100 with either ﬁxed or random covariance matrices.