reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AcademicEval: Live Long-Context LLM Benchmark

Authors: Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a holistic evaluation on Academic Eval, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs long-context modeling capabilities.
Researcher Affiliation	Academia	Haozhen Zhang EMAIL, EMAIL University of Illinois at Urbana-Champaign Tao Feng EMAIL University of Illinois at Urbana-Champaign Pengrui Han EMAIL University of Illinois at Urbana-Champaign Jiaxuan You EMAIL University of Illinois at Urbana-Champaign
Pseudocode	No	The paper describes procedures and methodologies in natural language text and flowcharts (Figure 2) but does not present any structured pseudocode or algorithm blocks with numbered steps formatted like code.
Open Source Code	Yes	Code is available at https://github.com/ulab-uiuc/Academic Eval.
Open Datasets	Yes	Academic Eval adopts papers on ar Xiv to introduce several academic writing tasks with long-context inputs, i.e., Title, Abstract, Introduction, and Related Work, which cover a wide range of abstraction levels and require no manual labeling.
Dataset Splits	Yes	Table 2: Data Statistics of Academic Eval (Initial Round). It includes 4 writing tasks and provides four settings of different context length for each task. For each setting, we list their Comp. Rate, Samples of Each, Chronological Split, and Timespan of Test Data. Title Writing ... 72%-19%-9% Abstract Writing ... 72%-19%-9% Introduction Writing ... 71%-20%-9% Related Work Writing ... 72%-20%-8%
Hardware Specification	No	The paper mentions using "LLM API provided by together.ai" but does not specify the underlying hardware (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies	No	The paper mentions several software components like "Py Mu PDF", "Lang Chain", "BERT tokenizer", "deberta-xlarge-mnli", and "ROUGE-L" but does not provide specific version numbers for these dependencies to ensure reproducibility.
Experiment Setup	Yes	API Access. In this paper, we conduct a comprehensive evaluation over Academic Eval benchmark using the LLM API provided by together.ai5. For each API call, we fix the temperature parameter to 0 (i.e., greedy decoding). Details of the Implementation of RALM. We use the inputs of Academic Eval as the external corpus of RALM (such as Target Content and Reference Content introduced in Section D). For text split, we use the Recursive Character Text Splitter from Lang Chain6 and set chunk size and chunk overlap to 512 and 64, respectively. For each retrieval, we recall up to 12 text chunks (limited by the context length of standard LLMs) based on text similarity (semantic similarity based on inner product for dense retrievers or similarity based on word frequency for sparse retrievers).