Task Descriptors Help Transformers Learn Linear Models In-Context

Authors: Ruomin Huang, Rong Ge

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we verify our results by showing that the weights converge to the predicted global minimum and Transformers indeed perform better with task descriptors. Finally, we empirically verify our findings in Section 5.
Researcher Affiliation Academia Ruomin Huang Duke University EMAIL Rong Ge Duke University EMAIL
Pseudocode No The paper describes the methods narratively and mathematically but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about open-sourcing code or links to a code repository.
Open Datasets No We generate 4096 i.i.d. input sequences for each episode of training. For all experiments in this section, the data dimension d = 5 and the covariance matrix Λ = Id.
Dataset Splits No We generate 4096 i.i.d. input sequences for each episode of training. In practice, we can generate m sequences Sτ1, Sτ2, ..., Sτm, and the empirical loss is just the mean-squared error for all the sequences.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No We use Adam optimizer (Kingma & Ba, 2015) to train Transformers. We also use ℓ2 gradient clipping to stabilize training. No specific version numbers for software libraries or environments are provided.
Experiment Setup Yes We generate 4096 i.i.d. input sequences for each episode of training. For all experiments in this section, the data dimension d = 5 and the covariance matrix Λ = Id. For all experiments, we use Adam optimizer (Kingma & Ba, 2015) to train Transformers. We also use ℓ2 gradient clipping to stabilize training.