reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Structured Packing in LLM Training Improves Long Context Utilization

Authors: Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, Yu Zhao, Henryk Michalewski, Łukasz Kuciński, Piotr Miłoś

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that SPLICE improves performance on long-context tasks, particularly by achieving perfect accuracy on the synthetic Needle in the Haystack benchmark, and effectively mitigating the lost-in-the-middle phenomenon often observed in large language models. Notably, these long-context capabilities also extend to realistic downstream tasks, such as Qasper, across multiple model sizes 3B, 7B, and 13B and are achieved with only brief fine-tuning on 2-6 billion tokens. We supplement these results with a detailed analysis of SPLICE, examining the impact of hyperparameter choices, the different mixtures and proportions of SPLICE-generated training data, and the choice of the retriever. We empirically validate SPLICE showing that fine-tuning of Open LLa MA 3Bv2, 7Bv2 (Geng and Liu 2023) and Code Llama 13B (Rozi ere et al. 2023) with mere 2B 6B tokens already brings improvements in handling long context information in downstream tasks that require retrieval and in-context learning.
Researcher Affiliation	Collaboration	1University of Warsaw, Krakowskie Przedmiescie 26/28, 00-927 Warsaw, Poland 2IDEAS NCBR, Chmielna 69, 00-801 Warsaw, Poland 3University of Edinburg, Old College, South Bridge, Edinburgh EH8 9YL, United Kingdom 4Google Deep Mind, 5 New Street Square, London, United Kingdom 5Institute of Mathematics Polish Academy of Sciences, Jana i Jedrzeja Sniadeckich 8, 00-656, Warsaw, Poland 6x AI, San Francisco Bay Area, California, U.S.
Pseudocode	Yes	Algorithm 1: SPLICE training sample construction
Open Source Code	Yes	Code https://github.com/ideas-ncbr/publications
Open Datasets	Yes	We fine-tune Open LLa MA 3Bv2, Open LLa MA 7Bv2 (Geng and Liu 2023) and Code Llama 13B (Rozi ere et al. 2023) using SPLICE, showing that it improves long-contex downstream performance. These tasks include Qasper (Dasigi et al. 2021) from SCROLLS (Shaham et al. 2022), Hot Pot QA (Yang et al. 2018), Needle In A Haystack (Kamradt 2023), TREC (Li and Roth 2002; Hovy et al. 2001), and DBpedia (Lehmann et al. 2015). For 3B model experiments, we fine-tune on a 50/50 mixture of Red Pajama, prepared in the standard way, and C prepared using SPLICE BM25. For 7B and 13B ones, we fine-tune on a 50/25/25 mixture of Red Pajama (50) prepared in the standard way, Stack Exchange (25) and C (25) prepared using SPLICE BM25. Stack Exchange is part of Red Pajama (Together Computer 2023), and C data come from Star Coder (Li et al. 2023b).
Dataset Splits	No	The paper describes data mixtures used for fine-tuning, such as "50/50 mixture of Red Pajama... and C" or "50/25/25 mixture of Red Pajama... Stack Exchange and C". It also mentions using "held-out portions of the ar Xiv... and Star Coder" for perplexity evaluation and sampling a "subset of 500 elements of the evaluation set" for DBpedia. However, it does not provide specific train/test/validation split percentages, absolute sample counts for each split, or clear references to predefined standard splits for the experimental tasks in a way that would allow direct reproduction of data partitioning.
Hardware Specification	No	The paper mentions "TPU Research Cloud program" as instrumental for providing computational resources, but it does not specify the exact GPU/CPU models, TPU versions, or specific configurations (e.g., number of cores, memory) used for the experiments. This level of detail is insufficient for hardware reproduction.
Software Dependencies	No	The paper mentions methods like BM25, Contriever, and tools such as "retriv" and "Faiss". However, it does not provide specific version numbers for these tools or for general software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions, which are necessary for reproducibility.
Experiment Setup	Yes	We use a batch size of 256K (512K, resp.) tokens per step for 3B and 7B (13B, resp.) models. We set the learning rate of 1.5e 5 with linear warmup and cosine decay, following (Geng and Liu 2023). If not stated otherwise, in SPLICE we use k = 1 and the identity permutation as Order in the Algorithm 1. Hyperparameter details can be found in Appendix A.