reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding the Generalization of In-Context Learning in Transformers: An Empirical Study

Authors: Xingxuan Zhang, Haoran Wang, Jiansheng Li, Yuan Xue, Shikai Guan, Renzhe Xu, Hao Zou, Han Yu, Peng Cui

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive simulation and real-world experiments, encompassing tasks such as function fitting, API calling, and translation, we find that transformers lack inter-problem generalization with ICL, but excel in intra-task and intra-problem generalization.
Researcher Affiliation	Academia	Tsinghua University, Equal Contribution, Corresponding Author EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methodologies and procedures in narrative text and tables but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Code is available at https://github.com/Ube Cc/Generalization-of-Transformers
Open Datasets	Yes	Our primary focus is on tool-calling tasks for pretrained models and translation tasks for models trained from scratch. Unless otherwise stated, the results reported in this section are the average of three independent runs. ... we pretrain randomly initialized models with the same architecture of Qwen2-1.5B (Yang et al., 2024; Bai et al., 2023) on the CC100 (Conneau, 2019) dataset, limiting the task language scope to English and German. Building upon the models pretrained on CC100, we train three models on WMT14 corpus for the translation task: Baselinee2d, Baselined2e, and Com Func Learner with translation from English to German data, translation from German to English data, and translation from mixed data to English data, respectively.
Dataset Splits	Yes	We also synthesise 0.5k single-turn data and 1k multi-turn data using gpt-4o for validation. ...We use 2k instruction data from Huggingface45 for each language 2. Use 0.3M translation training data from WMT14-en2de6. ...We select 2M pieces of English and German corpus from CC1003 relatively as our pretraining data. We sample 500 sentences from the combination of WMT14 and WMT197 en2de test set called WMT500.
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types) used for conducting the experiments.
Software Dependencies	No	The paper mentions using models like GPT-2, LLaMA-3, LLaMA-2, and Qwen-2, and packages such as 'bert-score' and 'sacrebleu', but it does not specify any version numbers for these or other software dependencies.
Experiment Setup	Yes	During training, we set batch_size = 128, learning_rate = 5e 5. For convex and product combination, we train 50k steps in total, and we train 100k steps for composition combination as the loss is harder to converge. Following Garg et al. (2022), we set dropout = 0 because the model will see each input once as we resample at each step. Parameter Value batch_size 32 optimizer Adam W learning_rate 5e-6 weight_decay 0.001 max_grad_norm 0.3 warmup_ratio 0.03 lr_scheduler_type constant During training phase, we use a learning rate of 5 10 5, a warpup ratio of 4 10 2 and a total batch size of 128. We train the model for two epochs and select the model checkpoint with the best performance on our validation set.