NoLiMa: Long-Context Evaluation Beyond Literal Matching

Authors: Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schuetze

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. [...] We conduct extensive analyses using NOLIMA, yielding the following insights: [...] Table 3 presents the performance results of all NOLIMA tests on the selected models.
Researcher Affiliation Collaboration 1Center for Information and Language Processing, LMU Munich, Germany 2Munich Center for Machine Learning (MCML) 3Adobe Research. Correspondence to: Ali Modarressi <EMAIL>.
Pseudocode No The paper describes the 'Haystack filtering pipeline for undesired or misleading content' with a diagram and accompanying textual explanation in Section 3.1. It also details the 'Needle Set Design & Considerations' in Section A. However, it does not provide any structured pseudocode or algorithm blocks.
Open Source Code Yes We publicly release the dataset and evaluation code at https://github.com/adoberesearch/No Li Ma.1
Open Datasets Yes We publicly release the dataset and evaluation code at https://github.com/adoberesearch/No Li Ma.1
Dataset Splits No The paper describes the generation of haystacks and needle placements, resulting in '7,540 tests per context length experiment'. It also mentions 'Evaluations at context lengths of 250, 500, and 1K are used to compute the base score.' This details the evaluation setup for pre-trained models rather than providing traditional train/validation/test splits for model training.
Hardware Specification No The paper mentions that open-weight models were deployed using the vLLM library, but it does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The paper mentions that 'open weights were deployed using the v LLM library (Kwon et al., 2023), with weights obtained from Hugging Face (Wolf et al., 2020)'. However, it does not provide specific version numbers for the vLLM library or any other key software dependencies.
Experiment Setup Yes During inference, we use a task template (see Appendix C) that instructs the model to answer the question based on the provided text. [...] For all standard instruction-tuned models, we use greedy decoding during generation. For reasoning-based models, we utilize the default sampling decoding mechanism for GPT-o1 and GPT-o3 Mini, while R1-based models employ top-P sampling with p = 0.95 and a temperature of 0.6. In addition, we cap the maximum number of generated tokens in reasoning-based models at 1536 tokens, including both reasoning and output tokens.