reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

Authors: Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through controlled experiments, we find that the model s effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). ... We also develop Long Bench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark...
Researcher Affiliation	Collaboration	Yushi Bai1, Jiajie Zhang1, Xin Lv2, Linzhi Zheng1, Siqi Zhu1, Lei Hou1, Yuxiao Dong1, Jie Tang1 , Juanzi Li1 1Tsinghua University 2Zhipu AI
Pseudocode	No	The paper describes a pipeline called Agent Write with
Open Source Code	Yes	Our code & models are at: https://github.com/THUDM/Long Writer.
Open Datasets	No	The paper mentions creating Long Writer-6k and Long Bench-Write datasets but does not provide a direct link, DOI, or explicit statement for their public availability. The provided GitHub link is for 'code & models', not explicitly for 'data' or 'datasets'.
Dataset Splits	No	The paper describes how training sets were filtered based on output length for controlled experiments and how the Long Bench-Write benchmark is divided into subsets by word count for evaluation. It also details the construction of DPO data. However, it does not provide explicit train/test/validation splits (e.g., percentages or sample counts) for a singular dataset used in the main experiments, nor a clear split for the Long Writer-6k dataset itself.
Hardware Specification	Yes	All models are trained using a node with 8x H800 80G GPUs and Deep Speed+Ze RO3+CPU offloading (Rasley et al., 2020).
Software Dependencies	No	The paper mentions 'Deep Speed+Ze RO3+CPU offloading (Rasley et al., 2020)' as part of the training setup. However, it does not specify version numbers for Deep Speed or ZeRO3.
Experiment Setup	Yes	We use a batch size of 8, a learning rate of 1e-5, and a packing length of 32k. We train the models for 4 epochs, which takes approximately 2,500-3,000 steps.