reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Needle Threading: Can LLMs Follow Threads Through Near-Million-Scale Haystacks?

Authors: Jonathan Roberts, Kai Han, Samuel Albanie

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window.
Researcher Affiliation	Academia	Jonathan Roberts Kai Han Samuel Albanie University of Cambridge The University of Hong Kong Emails: EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes various tasks (Single Needle, Multiple Needles, Conditional Needles, Threading, Multi-Threading, Branched Threading) and their structure using string-serialized JSON objects and schematics (Figure 2). However, it does not provide any explicitly labeled pseudocode or algorithm blocks for a method proposed by the authors.
Open Source Code	Yes	We release our code and long context experimental data. ... We release our code and tasks for the community to use and we hope that our findings encourage further long context understanding research.
Open Datasets	Yes	We release our code and long context experimental data. ... Taking inspiration from prior works (Liu et al., 2024; Hsieh et al., 2024a; Zhang et al., 2024), we focus our experimentation on abstract tasks containing synthetically generated data. By using synthetic data, (1) we avoid potentially expensive question-and-answer curation and annotation, (2) we ensure high-quality and noise-free data, and (3) we gain fine-grained control over the sequence length and other task parameters, allowing direct influence on difficulty.
Dataset Splits	No	For most models, we repeat each experiment on 5 different sets of haystacks and report the average performance, however, in some cases, only 1 repeat was feasible due to rate limit restrictions. ... Experiments were carried out on haystacks of 12 different sizes ranging from 1k to 630k tokens (measured in LLa MA 3.1 tokens). The paper describes how synthetic input 'haystacks' are generated and varied for evaluation purposes (e.g., varying context length, number of needles, thread length) and that experiments are repeated, but it does not specify traditional training, validation, and test dataset splits for training a model.
Hardware Specification	No	We evaluate the LLMs via the Vertex AI (Google, 2024) {Gemini, Claude, Jamba, LLa MA 3.1, and Mistral}, Open AI (Open AI, 2024a) {GPT}, and Reka (AI, 2024b) {Reka} APIs. The authors used commercial API services for inference, indicating that they did not run experiments on their own specified hardware. The paper discusses 'API-based restrictions and limitations' including 'Cost', 'Context restrictions', and 'Latency', further supporting the use of cloud-based APIs rather than dedicated hardware.
Software Dependencies	No	We evaluate the LLMs via the Vertex AI (Google, 2024) {Gemini, Claude, Jamba, LLa MA 3.1, and Mistral}, Open AI (Open AI, 2024a) {GPT}, and Reka (AI, 2024b) {Reka} APIs. ... Closed-source model API versions: GPT-4o mini: gpt-4o-mini-2024-07-18, GPT-4o: gpt-4o-2024-08-06, Gemini-Pro: gemini-1.0-pro-002, Gemini 1.5 Flash: gemini-1.5-flash-preview-0514, Gemini 1.5 Pro: gemini-1.5-pro-preview-0514, Claude 3 Haiku: claude-3-haiku@20240307, Claude 3 Sonnet: claude-3-sonnet@20240229, Claude 3.5 Sonnet: claude-3-5-sonnet@20240620, Reka Flash: reka-flash-20240904, Reka Core: reka-core-20240415. The paper lists specific commercial LLM APIs and their versions used for evaluation, which are external services rather than ancillary software components with version numbers that the authors would install and run locally (e.g., Python libraries, frameworks).
Experiment Setup	Yes	Prompting. We used a simple prompting strategy throughout our experimentation that consisted of a single basic user prompt containing the question and output format instructions for each task. In keeping with prior works (Roberts et al., 2024a;b; Open AI, 2024b), we do not modify the system prompt or tailor the prompt for each model. With the exception of providing examples of the desired output format, we do not use few-shot examples or explicitly encourage reasoning. ... Inference. All inference was carried out in a zero-shot setting. To aid reproducibility, we set model hyperparameters that encourage as deterministic generation as possible. Concretely, we use greedy search decoding strategies in which the most probable token is selected from the model vocabulary V at each step, conditional on the preceding tokens i.e., wn+1 = arg maxw V P(w\|w1, w2, . . . , wn). We achieve this by specifying random seeds and setting the temperature parameter to zero.