reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Authors: Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schuetze

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. [...] We conduct extensive analyses using NOLIMA, yielding the following insights: [...] Table 3 presents the performance results of all NOLIMA tests on the selected models.
Researcher Affiliation	Collaboration	1Center for Information and Language Processing, LMU Munich, Germany 2Munich Center for Machine Learning (MCML) 3Adobe Research. Correspondence to: Ali Modarressi <EMAIL>.
Pseudocode	No	The paper describes the 'Haystack filtering pipeline for undesired or misleading content' with a diagram and accompanying textual explanation in Section 3.1. It also details the 'Needle Set Design & Considerations' in Section A. However, it does not provide any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We publicly release the dataset and evaluation code at https://github.com/adoberesearch/No Li Ma.1
Open Datasets	Yes	We publicly release the dataset and evaluation code at https://github.com/adoberesearch/No Li Ma.1
Dataset Splits	No	The paper describes the generation of haystacks and needle placements, resulting in '7,540 tests per context length experiment'. It also mentions 'Evaluations at context lengths of 250, 500, and 1K are used to compute the base score.' This details the evaluation setup for pre-trained models rather than providing traditional train/validation/test splits for model training.
Hardware Specification	No	The paper mentions that open-weight models were deployed using the vLLM library, but it does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies	No	The paper mentions that 'open weights were deployed using the v LLM library (Kwon et al., 2023), with weights obtained from Hugging Face (Wolf et al., 2020)'. However, it does not provide specific version numbers for the vLLM library or any other key software dependencies.
Experiment Setup	Yes	During inference, we use a task template (see Appendix C) that instructs the model to answer the question based on the provided text. [...] For all standard instruction-tuned models, we use greedy decoding during generation. For reasoning-based models, we utilize the default sampling decoding mechanism for GPT-o1 and GPT-o3 Mini, while R1-based models employ top-P sampling with p = 0.95 and a temperature of 0.6. In addition, we cap the maximum number of generated tokens in reasoning-based models at 1536 tokens, including both reasoning and output tokens.