reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Long-context LLMs Struggle with Long In-context Learning

Authors: Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce a benchmark (Long ICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences.
Researcher Affiliation	Academia	University of Waterloo Carnegie Mellon University Vector Institute, Toronto EMAIL
Pseudocode	No	The paper describes methods and experiments but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/TIGER-AI-Lab/Long ICLBench
Open Datasets	Yes	We collect six datasets containing context length from short to long... Go Emotions (Demszky et al., 2020)... BANKING77 (Casanueva et al., 2020)... Tac RED (Zhang et al., 2017)... Few-NERD (Ding et al., 2021)... Dialog RE (Yu et al., 2020)... Discovery (Sileo et al., 2019)...
Dataset Splits	Yes	In order to balance the sequence token length within each dataset and the goal of evaluation for long in-context learning, we keep a subset of the classes among all the classes to format evaluation sets around 1 round, 2 rounds, 3 rounds, 4 rounds, and 5 rounds correspondingly, where each round represent a complete set of examples containing all unique chosen labels. We sample the number of instances from each of the classes evenly to reduce the bias resulting from the label distribution. ... For testing, we sample 500 examples from the test set of each dataset, simultaneously ensuring an even distribution in terms of the type of labels.
Hardware Specification	Yes	All the open-source models are loaded from the weights in Hugging Face1, and inferred on eight NVIDIA RTX A6000 GPUs
Software Dependencies	No	The paper mentions loading models from Hugging Face and using API-based models but does not provide specific version numbers for software dependencies like Python, PyTorch, or other libraries used in their experimental setup.
Experiment Setup	Yes	We construct a prompt following the template as shown in A.2 for each of the datasets. To fairly evaluate the open-source and API-based models with a series of input lengths, we sample the same example set for all the models with labels distributed evenly to ensure an unbiased distribution for the in-context demonstration. For instance, an input of one round will include one set of examples traversing all the types, and 5 rounds will contain instances from each of the labels 5 times.