The Closeness of In-Context Learning and Weight Shifting for Softmax Regression
Authors: Shuai Li, Zhao Song, Yu Xia, Tong Yu, Tianyi Zhou
NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present our numerical experiments to validate our theoretical results that when training self-attention-only Transformers for softmax regression tasks, the models learned by gradient-descent and Transformers show great similarity. |
| Researcher Affiliation | Collaboration | Shuai Li Shanghai Jiao Tong University EMAIL Zhao Song Simons Institute for the Theory of Computing, UC Berkeley EMAIL Yu Xia University of California, San Diego EMAIL Tong Yu Adobe Research EMAIL Tianyi Zhou University of Southern California EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The data and code are planned to be released upon acceptance and approval. |
| Open Datasets | No | According to Definition 1.3, we construct the synthetic softmax regression tasks consists of randomly sampled length-n documents A Rn d where each word has the d-dimensional embedding and targets b Rn. Each document is generated from a unique random seed. The paper does not provide concrete access information (link, DOI, formal citation) for a publicly available or open dataset. |
| Dataset Splits | No | To compare the trained single self-attention layer with a softmax unit and the softmax regression model trained with one-step gradient descent, we sample 10^3 tasks and record the losses of two models. While a 'training set' of tasks is mentioned for learning rate, explicit train/validation/test splits of a dataset are not described in the typical sense for reproducibility. |
| Hardware Specification | Yes | All experiments run on a single NVIDIA RTX2080Ti GPU with 10 independent repetitions. |
| Software Dependencies | No | The paper does not specify any software versions or library dependencies required for replication. |
| Experiment Setup | Yes | For the single self-attention layer with a softmax unit, we choose the learning rate ηSA = 0.005. For the softmax regression model, we determine the optimal learning rate ηGD by minimizing the ℓ2 regression loss over a training set of 103 tasks through line search. |