AcademicEval: Live Long-Context LLM Benchmark

Authors: Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a holistic evaluation on Academic Eval, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs long-context modeling capabilities.
Researcher Affiliation Academia Haozhen Zhang EMAIL, EMAIL University of Illinois at Urbana-Champaign Tao Feng EMAIL University of Illinois at Urbana-Champaign Pengrui Han EMAIL University of Illinois at Urbana-Champaign Jiaxuan You EMAIL University of Illinois at Urbana-Champaign
Pseudocode No The paper describes procedures and methodologies in natural language text and flowcharts (Figure 2) but does not present any structured pseudocode or algorithm blocks with numbered steps formatted like code.
Open Source Code Yes Code is available at https://github.com/ulab-uiuc/Academic Eval.
Open Datasets Yes Academic Eval adopts papers on ar Xiv to introduce several academic writing tasks with long-context inputs, i.e., Title, Abstract, Introduction, and Related Work, which cover a wide range of abstraction levels and require no manual labeling.
Dataset Splits Yes Table 2: Data Statistics of Academic Eval (Initial Round). It includes 4 writing tasks and provides four settings of different context length for each task. For each setting, we list their Comp. Rate, Samples of Each, Chronological Split, and Timespan of Test Data. Title Writing ... 72%-19%-9% Abstract Writing ... 72%-19%-9% Introduction Writing ... 71%-20%-9% Related Work Writing ... 72%-20%-8%
Hardware Specification No The paper mentions using "LLM API provided by together.ai" but does not specify the underlying hardware (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies No The paper mentions several software components like "Py Mu PDF", "Lang Chain", "BERT tokenizer", "deberta-xlarge-mnli", and "ROUGE-L" but does not provide specific version numbers for these dependencies to ensure reproducibility.
Experiment Setup Yes API Access. In this paper, we conduct a comprehensive evaluation over Academic Eval benchmark using the LLM API provided by together.ai5. For each API call, we fix the temperature parameter to 0 (i.e., greedy decoding). Details of the Implementation of RALM. We use the inputs of Academic Eval as the external corpus of RALM (such as Target Content and Reference Content introduced in Section D). For text split, we use the Recursive Character Text Splitter from Lang Chain6 and set chunk size and chunk overlap to 512 and 64, respectively. For each retrieval, we recall up to 12 text chunks (limited by the context length of standard LLMs) based on text similarity (semantic similarity based on inner product for dense retrievers or similarity based on word frequency for sparse retrievers).