NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens
Authors: Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Xiangkun Hu, Zheng Zhang, Qian Wang, Yue Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation of long-context LLMs on Novel QA reveals significant insights into their strengths and weaknesses. Notably, the models struggle with multi-hop reasoning, detail-oriented questions, and handling extremely long inputs, with average lengths exceeding 200,000 tokens. Results highlight the need for substantial advancements in LLMs to enhance their long-context comprehension and contribute effectively to computational literary analysis. |
| Researcher Affiliation | Collaboration | Cunxiang Wang1 , Ruoxi Ning12 , Boqi Pan3, Tonghui Wu3, Qipeng Guo4, Cheng Deng5, Guangsheng Bao1, Xiangkun Hu5, Zheng Zhang6, Qian Wang3 and Yue Zhang1 1Westlake University; 2University of Waterloo; 3Hangzhou Normal University; 4Shanghai AI Lab; 5SJTU; 6NYU Shanghai |
| Pseudocode | No | The paper describes the methodology for creating the Novel QA benchmark and evaluating LLMs on it through narrative text and tables. It does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We have released the demonstrations and input of Novel QA, and created a leaderboard. More details can be found in https://novelqa.github.io/. And Novel QA is released under the Apache-2.0 License. For the public access, we have released all constructed data on Huggingface https://huggingface.co/ datasets/Novel QA/Novel QA and an evaluation system on Codabench https://www.codabench. org/competitions/2727/. |
| Open Datasets | Yes | For the public access, we have released all constructed data on Huggingface https://huggingface.co/ datasets/Novel QA/Novel QA and an evaluation system on Codabench https://www.codabench. org/competitions/2727/. |
| Dataset Splits | No | The paper mentions that golden answers for the test set will not be released to prevent data leakage, implying the existence of a test set. However, it does not provide specific details on the splitting methodology, such as percentages, sample counts for train/validation/test sets, or how these splits are defined or accessed for reproducibility. |
| Hardware Specification | Yes | Running long-context LLMs on extremely long inputs, such as 200K tokens, is a challenge due to the immense GPU memory required, for example, it takes roughly 2.5T memory to calculate one attention matrix for a 7B model with a 200K-token input, while our local device is a 4 80G A100. |
| Software Dependencies | No | To address this, we utilize the LMDeploy (Contributors, 2023) (based on Dynamic NTK (emozilla, 2023)) and v LLM (Kwon et al., 2023) for memory and time reduction, which is only compatible with several LLMs. |
| Experiment Setup | Yes | We set temperature = 0 to eliminate randomness and keep other hyper-parameters default. |