Why Does the Effective Context Length of LLMs Fall Short?

Authors: Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and Infinite Bench, establishing new state-of-the-art results for open-source LLMs.
Researcher Affiliation Collaboration Chenxin An1 Jun Zhang2 Ming Zhong3 Lei Li1 Shansan Gong1 Yao Luo2 Jingjing Xu2 Lingpeng Kong1 1The University of Hong Kong 2Byte Dance Inc. 3University of Illinois Urbana-Champaign
Pseudocode Yes The pseudocode for STRING is provided in Algorithm 1. ... Algorithm 1 Pseudocode of STRING with Flash Attention ... Algorithm 2 Pseudocode of merge_diag_shifted
Open Source Code Yes All code and data used in this work are released at https://github.com/HKUNLP/STRING.
Open Datasets Yes We pretrain two 1.3B-parameter models (referred to as Tiny Llama-1.3B) from scratch on the natural data distribution of the Slim Pajama dataset... Slim Pajama-627B (Cerebras, 2023) ... To measure the effective context length, we adopt the popular Needle-in-a-Haystack task (gkamradt, 2023).
Dataset Splits Yes We use the 4-needle setting, the same as described in the Llama 3.1 report (Llama Team, 2024), which involves inserting four needles (6-digit numbers (Hsieh et al., 2024; Mohtashami & Jaggi, 2023)) into the context at various positions. The model should perfectly retrieve at least two of them. ... We perform 500 tests at each length.
Hardware Specification Yes We utilized 16 NVIDIA 80G A100 GPUs on 2 nodes.
Software Dependencies Yes The main speed optimization libraries employed in this project are Fully Sharded Data Parallel (FSDP)5, Flash Attention-2 (Dao, 2023)6, and x Formers (Lefaudeux et al., 2022)7. ... We use the cross entropy loss as the pretraining objective and the Adam W optimizer (Loshchilov & Hutter, 2019).
Experiment Setup Yes We utilize a hidden size of 2,048, the size of the feed-forward layers inside each transformer block is set to 5632. The model employs 32 attention heads and comprises 22 layers. ... We used the Slim Pajama-627B (Cerebras, 2023) dataset as our pretraining corpus and total training tokens for each model is 1T tokens. ... We use the cross entropy loss as the pretraining objective and the Adam W optimizer (Loshchilov & Hutter, 2019). Additionally, we employed a cosine learning rate schedule with a maximum learning rate of 4 10 4, starting from a minimum learning rate of 4 10 5. The warmup steps are 2,000. The batch size is set to 4M tokens for different training context lengths. For the model pretrained with a 4K context length, the gradient accumulation is set to twice that of the model trained with a 2K context length. ... A gradient clipping threshold of 1.0 is used to stablize the gradient.