reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Authors: Yuhao Wu, Ming Shan Hee, Zhiqiang Hu, Roy Ka-Wei Lee

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform extensive experiments on both open-source and closed-source models, revealing that despite their advanced capabilities, most models struggle significantly with super-longform generation tasks, particularly in maintaining instruction adherence and coherence over long outputs.
Researcher Affiliation	Academia	1Singapore University of Technology and Design EMAIL roy EMAIL
Pseudocode	Yes	Algorithm 1 Evaluations Pipeline
Open Source Code	Yes	We opensource Long Gen Bench to promote comprehensive evaluation and improvement in this critical area, with code and data available at https://github.com/ mozhu621/Long Gen Bench.
Open Datasets	Yes	We introduce Long Gen Bench , a comprehensive dataset that provides a diverse set of tasks specifically designed to evaluate the super-long-form generation capabilities of LLMs across varying token lengths (16K and 32K) and levels of text complexity.
Dataset Splits	Yes	For each scenario, we generated 800 examples at two specified lengths: 16K tokens and 32K tokens.
Hardware Specification	Yes	Inferences were performed using BFloat16 precision on 8 NVIDIA A800 GPUs, employing greedy decoding to generate the outputs.
Software Dependencies	No	The paper mentions "We utilized the v LLM (Kwon et al., 2023) system" and "Huggingface (Wolf et al., 2019)", but does not specify version numbers for these or other software dependencies used in their experimental setup.
Experiment Setup	Yes	Task Configurations. For each scenario, we generated 800 examples at two specified lengths: 16K tokens and 32K tokens. The generation was based on designated templates for each model, ensuring task-specific relevance. ... To ensure the relevance of the generated content and prevent off-topic responses or refusals to answer, we prefixed each task input with a carefully curated answer prompt designed to guide the model s output.