Task Diversity Shortens the In-Context Learning Plateau
Authors: Jaeyeon Kim, Sehyun Kwon, Joo Young Choi, Jongho Park, Jaewoong Cho, Jason D. Lee, Ernest K. Ryu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we reveal that training on multiple diverse ICL tasks simultaneously shortens the loss plateaus, making each task easier to learn. This finding is surprising as it contradicts the natural intuition that the combined complexity of multiple ICL tasks would lengthen the learning process, not shorten it. Our result suggests that the recent success in large-scale training of language models may be attributed not only to the richness of the data at scale but also to the easier optimization (training) induced by the diversity of natural language training data. Table 1 summarizes our experimental results on transformers. |
| Researcher Affiliation | Collaboration | Jaeyeon Kim EMAIL Harvard University Sehyun Kwon EMAIL Samsung Research Joo Young Choi EMAIL KRAFTON Jongho Park EMAIL KRAFTON UC Berkeley Jaewoong Cho EMAIL KRAFTON Jason D. Lee EMAIL UC Berkeley Ernest K. Ryu EMAIL UCLA |
| Pseudocode | Yes | Algorithm 1 Batch generation process |
| Open Source Code | Yes | These authors contributed equally as co-first authors. Code available at https://github.com/sehyunkwon/task-diversity-icl |
| Open Datasets | Yes | We use the word data from Nguyen et al. (2017). |
| Dataset Splits | Yes | In our experiments, n = 120 is a predetermined number shared across all tasks. Test loss. To evaluate the performance on ICL tasks, we measure the model s prediction error on the last (n-th) output of the prompt. Each sequence consists of six examples where the preceding five serve as context examples, and the last one as the query example. |
| Hardware Specification | No | The paper discusses model architectures (transformer, Mamba, Hyena) and training configurations but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using model architectures like Transformer (GPT2), Mamba, and Hyena, and the Adam optimizer, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0.0001. We use B = 64, making each task s batch size as 64/k. Regarding the choice of ℓ( , ), we use mean-squared error loss for continuous ICL tasks and cross-entropy loss for boolean ICL tasks. The input dimensions considered are d = 10 and 15. |