LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
Authors: Yuhao Wu, Ming Shan Hee, Zhiqiang Hu, Roy Ka-Wei Lee
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experiments on both open-source and closed-source models, revealing that despite their advanced capabilities, most models struggle significantly with super-longform generation tasks, particularly in maintaining instruction adherence and coherence over long outputs. |
| Researcher Affiliation | Academia | 1Singapore University of Technology and Design EMAIL roy EMAIL |
| Pseudocode | Yes | Algorithm 1 Evaluations Pipeline |
| Open Source Code | Yes | We opensource Long Gen Bench to promote comprehensive evaluation and improvement in this critical area, with code and data available at https://github.com/ mozhu621/Long Gen Bench. |
| Open Datasets | Yes | We introduce Long Gen Bench , a comprehensive dataset that provides a diverse set of tasks specifically designed to evaluate the super-long-form generation capabilities of LLMs across varying token lengths (16K and 32K) and levels of text complexity. |
| Dataset Splits | Yes | For each scenario, we generated 800 examples at two specified lengths: 16K tokens and 32K tokens. |
| Hardware Specification | Yes | Inferences were performed using BFloat16 precision on 8 NVIDIA A800 GPUs, employing greedy decoding to generate the outputs. |
| Software Dependencies | No | The paper mentions "We utilized the v LLM (Kwon et al., 2023) system" and "Huggingface (Wolf et al., 2019)", but does not specify version numbers for these or other software dependencies used in their experimental setup. |
| Experiment Setup | Yes | Task Configurations. For each scenario, we generated 800 examples at two specified lengths: 16K tokens and 32K tokens. The generation was based on designated templates for each model, ensuring task-specific relevance. ... To ensure the relevance of the generated content and prevent off-topic responses or refusals to answer, we prefixed each task input with a carefully curated answer prompt designed to guide the model s output. |