reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Instruction-tuned LLMs to Million-token Contexts via Hierarchical Synthetic Data Generation

Authors: Linda He, Jue Wang, Maurice Weber, Shang Zhu, Ben Athiwaratkun, Ce Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive evaluations on shorter context lengths (100K and 180K) to demonstrate the effectiveness of our hierarchical strategy, multidocument combinations, and diverse question-answer pair generation. These evaluations validate that our core strategies work well across various tasks and context lengths. 3. Scaling to 1M context length: We successfully extend LLa MA-3.1-8B-Instruct to a context length of 1 million tokens. Our model significantly outperforms the LLa MA-3.1-8B-Instruct model in zero-shot Ro PE scaling to a 1M context window on the RULER benchmark and surpasses the gradientai/Llama-3-8B-Instruct-Gradient-1048k model trained by Gradient AI. Additionally, our model outcompetes LLa MA-3.1-8B-Instruct on Infinite Bench while maintaining strong performance on Long Bench and MMLU.
Researcher Affiliation	Collaboration	1Harvard University 2Together AI 3University of Chicago EMAIL, EMAIL, EMAIL
Pseudocode	Yes	A APPENDIX: ADDITIONAL DETAILS ON DATA GENERATION ALGORITHMS In this section, we present the pseudocode for the hierarchical QA generation strategy described in Section 3.1, along with the algorithm for combining multiple documents, as outlined in Section 3.2. Algorithm 1 Hierarchical Question Generation Strategy (Single Document) ... Algorithm 2 Concatenating Multiple Documents
Open Source Code	Yes	Reproducibility. We included the code to generate a bunch of hierarchical questions and diverse questions for a single document (see Section 3.1) in supplementary material (see generatingdata.py). We also included the code to concatenate multiple documents (see Section 3.2) in supplementary material (see concatenate-350K.py).
Open Datasets	Yes	Our primary dataset is the Together long books dataset1, processed into approximately 1.4 billion tokens, distributed across these stages: 2000 samples of 180K tokens, 1280 samples of 350K tokens, 600 samples of 650K tokens, and 200 samples of 1M tokens. We generated 582,900 QA pairs with hierarchical and diverse questions for robust instruction-tuning using the Together AI inference API 2. By sending 32 simultaneous API requests, it took about two days to create our full long-context instruction dataset, comprising 7,772 books. For each book, we generated 25 hierarchical and 50 diverse questions, resulting in 582,900 QA pairs alongside global summaries. During training, we calculate loss solely on answers, masking out questions and context to ensure the model focuses on reasoning and generating accurate answers without being penalized for reproducing input content. 1https://huggingface.co/datasets/togethercomputer/Long-Data-Collections
Dataset Splits	Yes	Our primary dataset is the Together long books dataset1, processed into approximately 1.4 billion tokens, distributed across these stages: 2000 samples of 180K tokens, 1280 samples of 350K tokens, 600 samples of 650K tokens, and 200 samples of 1M tokens.
Hardware Specification	Yes	Hardware. We fine-tuned our models on a SLURM cluster using 8 to 32 H100 GPUs across up to 4 nodes, connected via Infini Band for efficient multinode training. We used FSDP to shard the model across GPUs and implemented Deep Speed Ulysses sequence parallelism for long-context training.
Software Dependencies	No	The paper mentions software components like 'FSDP' and 'Deep Speed Ulysses' but does not specify any version numbers for these or any other software dependencies.
Experiment Setup	Yes	To extend Llama-3.1-8B-Instruct to a 1M context model, we applied stepwise rope scaling. Training started with 180K tokens and progressed through checkpoints at 350K, 650K, and 1M tokens, concatenating 4, 8, and 12 documents as per the algorithm in Section 3.2. We compiled 2000 samples at 180K, 1280 at 350K, 600 at 650K, and 200 at 1M tokens. Data was generated using Qwen-2-72B, fine-tuned on Llama-3.1-8B-Instruct with rope scaling at a 6e-5 learning rate for 1 epoch. Training the 650K model took 30 hours, and the 1M model took an additional 52.5 hours.