reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

Authors: Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Xiangkun Hu, Zheng Zhang, Qian Wang, Yue Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation of long-context LLMs on Novel QA reveals significant insights into their strengths and weaknesses. Notably, the models struggle with multi-hop reasoning, detail-oriented questions, and handling extremely long inputs, with average lengths exceeding 200,000 tokens. Results highlight the need for substantial advancements in LLMs to enhance their long-context comprehension and contribute effectively to computational literary analysis.
Researcher Affiliation	Collaboration	Cunxiang Wang1 , Ruoxi Ning12 , Boqi Pan3, Tonghui Wu3, Qipeng Guo4, Cheng Deng5, Guangsheng Bao1, Xiangkun Hu5, Zheng Zhang6, Qian Wang3 and Yue Zhang1 1Westlake University; 2University of Waterloo; 3Hangzhou Normal University; 4Shanghai AI Lab; 5SJTU; 6NYU Shanghai
Pseudocode	No	The paper describes the methodology for creating the Novel QA benchmark and evaluating LLMs on it through narrative text and tables. It does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We have released the demonstrations and input of Novel QA, and created a leaderboard. More details can be found in https://novelqa.github.io/. And Novel QA is released under the Apache-2.0 License. For the public access, we have released all constructed data on Huggingface https://huggingface.co/ datasets/Novel QA/Novel QA and an evaluation system on Codabench https://www.codabench. org/competitions/2727/.
Open Datasets	Yes	For the public access, we have released all constructed data on Huggingface https://huggingface.co/ datasets/Novel QA/Novel QA and an evaluation system on Codabench https://www.codabench. org/competitions/2727/.
Dataset Splits	No	The paper mentions that golden answers for the test set will not be released to prevent data leakage, implying the existence of a test set. However, it does not provide specific details on the splitting methodology, such as percentages, sample counts for train/validation/test sets, or how these splits are defined or accessed for reproducibility.
Hardware Specification	Yes	Running long-context LLMs on extremely long inputs, such as 200K tokens, is a challenge due to the immense GPU memory required, for example, it takes roughly 2.5T memory to calculate one attention matrix for a 7B model with a 200K-token input, while our local device is a 4 80G A100.
Software Dependencies	No	To address this, we utilize the LMDeploy (Contributors, 2023) (based on Dynamic NTK (emozilla, 2023)) and v LLM (Kwon et al., 2023) for memory and time reduction, which is only compatible with several LLMs.
Experiment Setup	Yes	We set temperature = 0 to eliminate randomness and keep other hyper-parameters default.