Task Diversity Shortens the In-Context Learning Plateau

Authors: Jaeyeon Kim, Sehyun Kwon, Joo Young Choi, Jongho Park, Jaewoong Cho, Jason D. Lee, Ernest K. Ryu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we reveal that training on multiple diverse ICL tasks simultaneously shortens the loss plateaus, making each task easier to learn. This finding is surprising as it contradicts the natural intuition that the combined complexity of multiple ICL tasks would lengthen the learning process, not shorten it. Our result suggests that the recent success in large-scale training of language models may be attributed not only to the richness of the data at scale but also to the easier optimization (training) induced by the diversity of natural language training data. Table 1 summarizes our experimental results on transformers.
Researcher Affiliation Collaboration Jaeyeon Kim EMAIL Harvard University Sehyun Kwon EMAIL Samsung Research Joo Young Choi EMAIL KRAFTON Jongho Park EMAIL KRAFTON UC Berkeley Jaewoong Cho EMAIL KRAFTON Jason D. Lee EMAIL UC Berkeley Ernest K. Ryu EMAIL UCLA
Pseudocode Yes Algorithm 1 Batch generation process
Open Source Code Yes These authors contributed equally as co-first authors. Code available at https://github.com/sehyunkwon/task-diversity-icl
Open Datasets Yes We use the word data from Nguyen et al. (2017).
Dataset Splits Yes In our experiments, n = 120 is a predetermined number shared across all tasks. Test loss. To evaluate the performance on ICL tasks, we measure the model s prediction error on the last (n-th) output of the prompt. Each sequence consists of six examples where the preceding five serve as context examples, and the last one as the query example.
Hardware Specification No The paper discusses model architectures (transformer, Mamba, Hyena) and training configurations but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using model architectures like Transformer (GPT2), Mamba, and Hyena, and the Adam optimizer, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0.0001. We use B = 64, making each task s batch size as 64/k. Regarding the choice of ℓ( , ), we use mean-squared error loss for continuous ICL tasks and cross-entropy loss for boolean ICL tasks. The input dimensions considered are d = 10 and 15.