Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
HELMET: How to Evaluate Long-context Models Effectively and Thoroughly
Authors: Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, Danqi Chen
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate the underlying reasons behind these practices and find that existing benchmarks often provide noisy signals due to limited coverage of applications, insufficient context lengths, unreliable metrics, and incompatibility with base models. In this work, we introduce HELMET (How to Evaluate Long-context Models Effectively and Thoroughly), a comprehensive benchmark encompassing seven diverse, application-centric categories. We also address several issues in previous benchmarks by adding controllable lengths up to 128K tokens, model-based evaluation for reliable metrics, and few-shot prompting for robustly evaluating base models. Consequently, we demonstrate that HELMET offers more reliable and consistent rankings of frontier LCLMs. Through a comprehensive study of 59 LCLMs, we find that (1) synthetic tasks like NIAH do not reliably predict downstream performance; (2) the diverse categories in HELMET exhibit distinct trends and low correlations with each other; and (3) while most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when tasks require full-context reasoning or following complex instructions the gap widens as length increases. |
| Researcher Affiliation | Collaboration | Howard Yenp Tianyu Gaop Minmin Houi Ke Dingi Daniel Fleischeri Peter Izsaki Moshe Wasserblati Danqi Chenp p Princeton Language and Intelligence, Princeton University i Intel |
| Pseudocode | No | The paper describes the methodology in natural language and refers to implementations, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our data and code are available at https://github.com/princeton-nlp/HELMET. |
| Open Datasets | Yes | We utilize Natural Questions (NQ; Kwiatkowski et al., 2019), Trivia QA (TQA; Joshi et al., 2017), Hotpot QA (Yang et al., 2018), and Pop QA (Mallen et al., 2023). We leverage ALCE (Gao et al., 2023)... We use the MS MARCO dataset (Bajaj et al., 2018)... TREC-coarse, TREC-fine (Li & Roth, 2002), BANKING77 (Casanueva et al., 2020), CLINC150 (Larson et al., 2019), and NLU (Liu et al., 2019). We use Narrative QA (Koหcisk y et al., 2018) and the English book QA and multiple choice (MC) subsets from BENCH (Zhang et al., 2024b)... Multi-Lex Sum (legal document summarization) and the English summarization task from BENCH (novel summarization)... synthetic recall tasks from RULER (an extended version of NIAH; Hsieh et al., 2024) and also add a JSON KV retrieval task (Liu et al., 2023) |
| Dataset Splits | Yes | We randomly sample 100 to 600 examples from each dataset; more details are in D. ... We report accuracy on the test set. ... We balance the label distribution among the evaluation set. |
| Hardware Specification | Yes | For all open-source models, we evaluate on a H100 GPUs with 80GB of memory. |
| Software Dependencies | No | We use the Hugging Face framework (Wolf et al., 2020) to load and generate model outputs. We apply instruction-tuned models chat template whenever applicable. We use Flash Attention2 (Dao, 2023) and BF16 for faster inference. The paper mentions software tools like Hugging Face framework and Flash Attention2 but does not specify their version numbers. |
| Experiment Setup | Yes | We evaluate each model at input lengths: L {8K, 16K, 32K, 64K, 128K}, where L is the number of Llama-2 tokens (Touvron et al., 2023), and use greedy decoding for all models to ensure consistency. ... For all model-based evaluations, we use GPT-4o-2024-05-13 as the judge. |