NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

Authors: Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Xiangkun Hu, Zheng Zhang, Qian Wang, Yue Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation of long-context LLMs on Novel QA reveals significant insights into their strengths and weaknesses. Notably, the models struggle with multi-hop reasoning, detail-oriented questions, and handling extremely long inputs, with average lengths exceeding 200,000 tokens. Results highlight the need for substantial advancements in LLMs to enhance their long-context comprehension and contribute effectively to computational literary analysis.
Researcher Affiliation Collaboration Cunxiang Wang1 , Ruoxi Ning12 , Boqi Pan3, Tonghui Wu3, Qipeng Guo4, Cheng Deng5, Guangsheng Bao1, Xiangkun Hu5, Zheng Zhang6, Qian Wang3 and Yue Zhang1 1Westlake University; 2University of Waterloo; 3Hangzhou Normal University; 4Shanghai AI Lab; 5SJTU; 6NYU Shanghai
Pseudocode No The paper describes the methodology for creating the Novel QA benchmark and evaluating LLMs on it through narrative text and tables. It does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes We have released the demonstrations and input of Novel QA, and created a leaderboard. More details can be found in https://novelqa.github.io/. And Novel QA is released under the Apache-2.0 License. For the public access, we have released all constructed data on Huggingface https://huggingface.co/ datasets/Novel QA/Novel QA and an evaluation system on Codabench https://www.codabench. org/competitions/2727/.
Open Datasets Yes For the public access, we have released all constructed data on Huggingface https://huggingface.co/ datasets/Novel QA/Novel QA and an evaluation system on Codabench https://www.codabench. org/competitions/2727/.
Dataset Splits No The paper mentions that golden answers for the test set will not be released to prevent data leakage, implying the existence of a test set. However, it does not provide specific details on the splitting methodology, such as percentages, sample counts for train/validation/test sets, or how these splits are defined or accessed for reproducibility.
Hardware Specification Yes Running long-context LLMs on extremely long inputs, such as 200K tokens, is a challenge due to the immense GPU memory required, for example, it takes roughly 2.5T memory to calculate one attention matrix for a 7B model with a 200K-token input, while our local device is a 4 80G A100.
Software Dependencies No To address this, we utilize the LMDeploy (Contributors, 2023) (based on Dynamic NTK (emozilla, 2023)) and v LLM (Kwon et al., 2023) for memory and time reduction, which is only compatible with several LLMs.
Experiment Setup Yes We set temperature = 0 to eliminate randomness and keep other hyper-parameters default.