reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Task Diversity Shortens the In-Context Learning Plateau

Authors: Jaeyeon Kim, Sehyun Kwon, Joo Young Choi, Jongho Park, Jaewoong Cho, Jason D. Lee, Ernest K. Ryu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we reveal that training on multiple diverse ICL tasks simultaneously shortens the loss plateaus, making each task easier to learn. This finding is surprising as it contradicts the natural intuition that the combined complexity of multiple ICL tasks would lengthen the learning process, not shorten it. Our result suggests that the recent success in large-scale training of language models may be attributed not only to the richness of the data at scale but also to the easier optimization (training) induced by the diversity of natural language training data. Table 1 summarizes our experimental results on transformers.
Researcher Affiliation	Collaboration	Jaeyeon Kim EMAIL Harvard University Sehyun Kwon EMAIL Samsung Research Joo Young Choi EMAIL KRAFTON Jongho Park EMAIL KRAFTON UC Berkeley Jaewoong Cho EMAIL KRAFTON Jason D. Lee EMAIL UC Berkeley Ernest K. Ryu EMAIL UCLA
Pseudocode	Yes	Algorithm 1 Batch generation process
Open Source Code	Yes	These authors contributed equally as co-first authors. Code available at https://github.com/sehyunkwon/task-diversity-icl
Open Datasets	Yes	We use the word data from Nguyen et al. (2017).
Dataset Splits	Yes	In our experiments, n = 120 is a predetermined number shared across all tasks. Test loss. To evaluate the performance on ICL tasks, we measure the model s prediction error on the last (n-th) output of the prompt. Each sequence consists of six examples where the preceding five serve as context examples, and the last one as the query example.
Hardware Specification	No	The paper discusses model architectures (transformer, Mamba, Hyena) and training configurations but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using model architectures like Transformer (GPT2), Mamba, and Hyena, and the Adam optimizer, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0.0001. We use B = 64, making each task s batch size as 64/k. Regarding the choice of ℓ( , ), we use mean-squared error loss for continuous ICL tasks and cross-entropy loss for boolean ICL tasks. The input dimensions considered are d = 10 and 15.