reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities

Authors: Mo Li, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, Kai Chen

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that, while recent reasoning models such as Deepseek-R1 and Open AI s o3 have demonstrated strong performance on mathematical reasoning benchmarks, they still struggle to generalize their reasoning abilities and perform poorly on our information-dense tasks, frequently encountering difficulties with continuous retrieval and reasoning even at relatively shorter context lengths. Furthermore, we identify and characterize a phenomenon termed under-thinking , wherein models prematurely conclude their reasoning processes despite the availability of relevant information.
Researcher Affiliation	Academia	Mo Li EMAIL Tsinghua University Shanghai AI Laboratory; Songyang Zhang EMAIL Shanghai AI Laboratory; Taolin Zhang EMAIL Tsinghua University Shanghai AI Laboratory; Haodong Duan EMAIL Shanghai AI Laboratory; Yunxin Liu EMAIL Tsinghua University; Kai Chen EMAIL Shanghai AI Laboratory
Pseudocode	Yes	Algorithm 1 ATC Data Generation Algorithm
Open Source Code	Yes	All codes and resources are publicly available at Open Compass.
Open Datasets	Yes	The haystack for English tasks is built by extending the prompt with passages from the Paul Graham Essays dataset (Kamradt, 2023), and for Chinese tasks, we use the Chinese Domain Modeling Eval dataset (Wei et al., 2023b) to ensure linguistic diversity and high-quality filler content. ... For retrieval tasks, these may be unique fabricated facts (e.g., Hidden on Emerald Island is the legendary Stardust Shard ), while for reasoning tasks, they are synthetic kinship needles, which are the same as those used in the information-dense task; see Sec. 3.2 for details. ... Our findings reveal that, despite recent top-performing models such as Open AI s o3(Open AI, 2025b) and Deep Seek-R1 (Deep Seek-AI, 2025) achieving impressive results on mathematical benchmarks such as AIME (Di Zhang, 2025) and MATH500 (Hendrycks et al., 2021)
Dataset Splits	No	The paper describes how the Needle Bench tasks are constructed and evaluated at different context lengths and needle depths, and repetitions (R=10) for result stability. However, it does not provide specific train/test/validation splits for any underlying dataset like Paul Graham Essays or Chinese Domain Modeling Eval, but rather uses them as sources for constructing prompts or filler content.
Hardware Specification	No	The paper states: "We used LMDeploy (Contributors, 2023) and v LLM (Kwon et al., 2023) to accelerate the inference process." This refers to software tools used for inference acceleration, not specific hardware components like GPU or CPU models, or memory details. No other specific hardware is mentioned.
Software Dependencies	No	The paper mentions using "LMDeploy (Contributors, 2023) and v LLM (Kwon et al., 2023) to accelerate the inference process." While specific software names are given, their version numbers are not provided in the text.
Experiment Setup	Yes	We evaluate mainstream open-source LLMs on the information-sparse tasks in Needle Bench at two representative context lengths: 32K and 128K tokens. Each model is tested at the maximum context length it officially supports. ... Unless otherwise specified, we use greedy decoding with temperature set to 0 for all model outputs. ... Token lengths are measured using the GPT-4 tokenizer. ... To mitigate the risk of instruction truncation... we subtract a buffer from the target context length when generating each input. ... For each configuration, we repeat the test R = 10 times to enhance result stability.