NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities
Authors: Mo Li, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, Kai Chen
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that, while recent reasoning models such as Deepseek-R1 and Open AI s o3 have demonstrated strong performance on mathematical reasoning benchmarks, they still struggle to generalize their reasoning abilities and perform poorly on our information-dense tasks, frequently encountering difficulties with continuous retrieval and reasoning even at relatively shorter context lengths. Furthermore, we identify and characterize a phenomenon termed under-thinking , wherein models prematurely conclude their reasoning processes despite the availability of relevant information. |
| Researcher Affiliation | Academia | Mo Li EMAIL Tsinghua University Shanghai AI Laboratory; Songyang Zhang EMAIL Shanghai AI Laboratory; Taolin Zhang EMAIL Tsinghua University Shanghai AI Laboratory; Haodong Duan EMAIL Shanghai AI Laboratory; Yunxin Liu EMAIL Tsinghua University; Kai Chen EMAIL Shanghai AI Laboratory |
| Pseudocode | Yes | Algorithm 1 ATC Data Generation Algorithm |
| Open Source Code | Yes | All codes and resources are publicly available at Open Compass. |
| Open Datasets | Yes | The haystack for English tasks is built by extending the prompt with passages from the Paul Graham Essays dataset (Kamradt, 2023), and for Chinese tasks, we use the Chinese Domain Modeling Eval dataset (Wei et al., 2023b) to ensure linguistic diversity and high-quality filler content. ... For retrieval tasks, these may be unique fabricated facts (e.g., Hidden on Emerald Island is the legendary Stardust Shard ), while for reasoning tasks, they are synthetic kinship needles, which are the same as those used in the information-dense task; see Sec. 3.2 for details. ... Our findings reveal that, despite recent top-performing models such as Open AI s o3(Open AI, 2025b) and Deep Seek-R1 (Deep Seek-AI, 2025) achieving impressive results on mathematical benchmarks such as AIME (Di Zhang, 2025) and MATH500 (Hendrycks et al., 2021) |
| Dataset Splits | No | The paper describes how the Needle Bench tasks are constructed and evaluated at different context lengths and needle depths, and repetitions (R=10) for result stability. However, it does not provide specific train/test/validation splits for any underlying dataset like Paul Graham Essays or Chinese Domain Modeling Eval, but rather uses them as sources for constructing prompts or filler content. |
| Hardware Specification | No | The paper states: "We used LMDeploy (Contributors, 2023) and v LLM (Kwon et al., 2023) to accelerate the inference process." This refers to software tools used for inference acceleration, not specific hardware components like GPU or CPU models, or memory details. No other specific hardware is mentioned. |
| Software Dependencies | No | The paper mentions using "LMDeploy (Contributors, 2023) and v LLM (Kwon et al., 2023) to accelerate the inference process." While specific software names are given, their version numbers are not provided in the text. |
| Experiment Setup | Yes | We evaluate mainstream open-source LLMs on the information-sparse tasks in Needle Bench at two representative context lengths: 32K and 128K tokens. Each model is tested at the maximum context length it officially supports. ... Unless otherwise specified, we use greedy decoding with temperature set to 0 for all model outputs. ... Token lengths are measured using the GPT-4 tokenizer. ... To mitigate the risk of instruction truncation... we subtract a buffer from the target context length when generating each input. ... For each configuration, we repeat the test R = 10 times to enhance result stability. |