LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
Authors: Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through controlled experiments, we find that the model s effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). ... We also develop Long Bench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark... |
| Researcher Affiliation | Collaboration | Yushi Bai1, Jiajie Zhang1, Xin Lv2, Linzhi Zheng1, Siqi Zhu1, Lei Hou1, Yuxiao Dong1, Jie Tang1 , Juanzi Li1 1Tsinghua University 2Zhipu AI |
| Pseudocode | No | The paper describes a pipeline called Agent Write with |
| Open Source Code | Yes | Our code & models are at: https://github.com/THUDM/Long Writer. |
| Open Datasets | No | The paper mentions creating Long Writer-6k and Long Bench-Write datasets but does not provide a direct link, DOI, or explicit statement for their public availability. The provided GitHub link is for 'code & models', not explicitly for 'data' or 'datasets'. |
| Dataset Splits | No | The paper describes how training sets were filtered based on output length for controlled experiments and how the Long Bench-Write benchmark is divided into subsets by word count for evaluation. It also details the construction of DPO data. However, it does not provide explicit train/test/validation splits (e.g., percentages or sample counts) for a singular dataset used in the main experiments, nor a clear split for the Long Writer-6k dataset itself. |
| Hardware Specification | Yes | All models are trained using a node with 8x H800 80G GPUs and Deep Speed+Ze RO3+CPU offloading (Rasley et al., 2020). |
| Software Dependencies | No | The paper mentions 'Deep Speed+Ze RO3+CPU offloading (Rasley et al., 2020)' as part of the training setup. However, it does not specify version numbers for Deep Speed or ZeRO3. |
| Experiment Setup | Yes | We use a batch size of 8, a learning rate of 1e-5, and a packing length of 32k. We train the models for 4 epochs, which takes approximately 2,500-3,000 steps. |