Why Does the Effective Context Length of LLMs Fall Short?
Authors: Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and Infinite Bench, establishing new state-of-the-art results for open-source LLMs. |
| Researcher Affiliation | Collaboration | Chenxin An1 Jun Zhang2 Ming Zhong3 Lei Li1 Shansan Gong1 Yao Luo2 Jingjing Xu2 Lingpeng Kong1 1The University of Hong Kong 2Byte Dance Inc. 3University of Illinois Urbana-Champaign |
| Pseudocode | Yes | The pseudocode for STRING is provided in Algorithm 1. ... Algorithm 1 Pseudocode of STRING with Flash Attention ... Algorithm 2 Pseudocode of merge_diag_shifted |
| Open Source Code | Yes | All code and data used in this work are released at https://github.com/HKUNLP/STRING. |
| Open Datasets | Yes | We pretrain two 1.3B-parameter models (referred to as Tiny Llama-1.3B) from scratch on the natural data distribution of the Slim Pajama dataset... Slim Pajama-627B (Cerebras, 2023) ... To measure the effective context length, we adopt the popular Needle-in-a-Haystack task (gkamradt, 2023). |
| Dataset Splits | Yes | We use the 4-needle setting, the same as described in the Llama 3.1 report (Llama Team, 2024), which involves inserting four needles (6-digit numbers (Hsieh et al., 2024; Mohtashami & Jaggi, 2023)) into the context at various positions. The model should perfectly retrieve at least two of them. ... We perform 500 tests at each length. |
| Hardware Specification | Yes | We utilized 16 NVIDIA 80G A100 GPUs on 2 nodes. |
| Software Dependencies | Yes | The main speed optimization libraries employed in this project are Fully Sharded Data Parallel (FSDP)5, Flash Attention-2 (Dao, 2023)6, and x Formers (Lefaudeux et al., 2022)7. ... We use the cross entropy loss as the pretraining objective and the Adam W optimizer (Loshchilov & Hutter, 2019). |
| Experiment Setup | Yes | We utilize a hidden size of 2,048, the size of the feed-forward layers inside each transformer block is set to 5632. The model employs 32 attention heads and comprises 22 layers. ... We used the Slim Pajama-627B (Cerebras, 2023) dataset as our pretraining corpus and total training tokens for each model is 1T tokens. ... We use the cross entropy loss as the pretraining objective and the Adam W optimizer (Loshchilov & Hutter, 2019). Additionally, we employed a cosine learning rate schedule with a maximum learning rate of 4 10 4, starting from a minimum learning rate of 4 10 5. The warmup steps are 2,000. The batch size is set to 4M tokens for different training context lengths. For the model pretrained with a 4K context length, the gradient accumulation is set to twice that of the model trained with a 2K context length. ... A gradient clipping threshold of 1.0 is used to stablize the gradient. |