LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Authors: Yuhao Wu, Ming Shan Hee, Zhiqiang Hu, Roy Ka-Wei Lee

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments on both open-source and closed-source models, revealing that despite their advanced capabilities, most models struggle significantly with super-longform generation tasks, particularly in maintaining instruction adherence and coherence over long outputs.
Researcher Affiliation Academia 1Singapore University of Technology and Design EMAIL roy EMAIL
Pseudocode Yes Algorithm 1 Evaluations Pipeline
Open Source Code Yes We opensource Long Gen Bench to promote comprehensive evaluation and improvement in this critical area, with code and data available at https://github.com/ mozhu621/Long Gen Bench.
Open Datasets Yes We introduce Long Gen Bench , a comprehensive dataset that provides a diverse set of tasks specifically designed to evaluate the super-long-form generation capabilities of LLMs across varying token lengths (16K and 32K) and levels of text complexity.
Dataset Splits Yes For each scenario, we generated 800 examples at two specified lengths: 16K tokens and 32K tokens.
Hardware Specification Yes Inferences were performed using BFloat16 precision on 8 NVIDIA A800 GPUs, employing greedy decoding to generate the outputs.
Software Dependencies No The paper mentions "We utilized the v LLM (Kwon et al., 2023) system" and "Huggingface (Wolf et al., 2019)", but does not specify version numbers for these or other software dependencies used in their experimental setup.
Experiment Setup Yes Task Configurations. For each scenario, we generated 800 examples at two specified lengths: 16K tokens and 32K tokens. The generation was based on designated templates for each model, ensuring task-specific relevance. ... To ensure the relevance of the generated content and prevent off-topic responses or refusals to answer, we prefixed each task input with a carefully curated answer prompt designed to guide the model s output.