Understanding the Generalization of In-Context Learning in Transformers: An Empirical Study
Authors: Xingxuan Zhang, Haoran Wang, Jiansheng Li, Yuan Xue, Shikai Guan, Renzhe Xu, Hao Zou, Han Yu, Peng Cui
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive simulation and real-world experiments, encompassing tasks such as function fitting, API calling, and translation, we find that transformers lack inter-problem generalization with ICL, but excel in intra-task and intra-problem generalization. |
| Researcher Affiliation | Academia | Tsinghua University, Equal Contribution, Corresponding Author EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methodologies and procedures in narrative text and tables but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code is available at https://github.com/Ube Cc/Generalization-of-Transformers |
| Open Datasets | Yes | Our primary focus is on tool-calling tasks for pretrained models and translation tasks for models trained from scratch. Unless otherwise stated, the results reported in this section are the average of three independent runs. ... we pretrain randomly initialized models with the same architecture of Qwen2-1.5B (Yang et al., 2024; Bai et al., 2023) on the CC100 (Conneau, 2019) dataset, limiting the task language scope to English and German. Building upon the models pretrained on CC100, we train three models on WMT14 corpus for the translation task: Baselinee2d, Baselined2e, and Com Func Learner with translation from English to German data, translation from German to English data, and translation from mixed data to English data, respectively. |
| Dataset Splits | Yes | We also synthesise 0.5k single-turn data and 1k multi-turn data using gpt-4o for validation. ...We use 2k instruction data from Huggingface45 for each language 2. Use 0.3M translation training data from WMT14-en2de6. ...We select 2M pieces of English and German corpus from CC1003 relatively as our pretraining data. We sample 500 sentences from the combination of WMT14 and WMT197 en2de test set called WMT500. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types) used for conducting the experiments. |
| Software Dependencies | No | The paper mentions using models like GPT-2, LLaMA-3, LLaMA-2, and Qwen-2, and packages such as 'bert-score' and 'sacrebleu', but it does not specify any version numbers for these or other software dependencies. |
| Experiment Setup | Yes | During training, we set batch_size = 128, learning_rate = 5e 5. For convex and product combination, we train 50k steps in total, and we train 100k steps for composition combination as the loss is harder to converge. Following Garg et al. (2022), we set dropout = 0 because the model will see each input once as we resample at each step. Parameter Value batch_size 32 optimizer Adam W learning_rate 5e-6 weight_decay 0.001 max_grad_norm 0.3 warmup_ratio 0.03 lr_scheduler_type constant During training phase, we use a learning rate of 5 10 5, a warpup ratio of 4 10 2 and a total batch size of 128. We train the model for two epochs and select the model checkpoint with the best performance on our validation set. |