Task Descriptors Help Transformers Learn Linear Models In-Context
Authors: Ruomin Huang, Rong Ge
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we verify our results by showing that the weights converge to the predicted global minimum and Transformers indeed perform better with task descriptors. Finally, we empirically verify our findings in Section 5. |
| Researcher Affiliation | Academia | Ruomin Huang Duke University EMAIL Rong Ge Duke University EMAIL |
| Pseudocode | No | The paper describes the methods narratively and mathematically but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about open-sourcing code or links to a code repository. |
| Open Datasets | No | We generate 4096 i.i.d. input sequences for each episode of training. For all experiments in this section, the data dimension d = 5 and the covariance matrix Λ = Id. |
| Dataset Splits | No | We generate 4096 i.i.d. input sequences for each episode of training. In practice, we can generate m sequences Sτ1, Sτ2, ..., Sτm, and the empirical loss is just the mean-squared error for all the sequences. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | We use Adam optimizer (Kingma & Ba, 2015) to train Transformers. We also use ℓ2 gradient clipping to stabilize training. No specific version numbers for software libraries or environments are provided. |
| Experiment Setup | Yes | We generate 4096 i.i.d. input sequences for each episode of training. For all experiments in this section, the data dimension d = 5 and the covariance matrix Λ = Id. For all experiments, we use Adam optimizer (Kingma & Ba, 2015) to train Transformers. We also use ℓ2 gradient clipping to stabilize training. |