Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization
Authors: Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, Bowen Zhou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across various model scales and benchmarks show that, within varying context windows, Fo PE maintains a more stable performance compared to other baselines. Several analyses and ablations bring further support to our method and theoretical modeling. We conduct experiments across several model scales and datasets. The perplexity in pre-training and the accuracy in needle-in-haystack demonstrate Fo PE s superiority over other baselines on length generalization. Evaluation on more complex tasks (i.e. summarization and few-shot question-answering) brings further support to our method and theoretical modeling. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Northeastern University 3Shanghai Artificial Intelligence Laboratory 4Shanghai Jiaotong University. Correspondence to: Biqing Qi <EMAIL>, Ning Ding <EMAIL>, Bowen Zhou <EMAIL>. |
| Pseudocode | Yes | Pseudo-code of Fo PE is shown in the final pages. |
| Open Source Code | Yes | https://github.com/Tsinghua C3I/Fourier-Position-Embedding |
| Open Datasets | Yes | We train models with a 10B-tokens subset of C4 (Raffel et al., 2020) and evaluate the perplexity in a validation set from C4. Setting 2: We train models with 5B tokens from Gutenberg Books (Hart, 2007)... For summarization, we use Gov Report (Huang et al., 2021) and Multi News (Fabbri et al., 2019). For few-shot question-answering, we use TREC (Li & Roth, 2002), Trivia QA (Joshi et al., 2017), and SAMSum (Gliwa et al., 2019). |
| Dataset Splits | No | We train models with a 10B-tokens subset of C4 (Raffel et al., 2020) and evaluate the perplexity in a validation set from C4. ... Setting 2: We train models with 5B tokens from Gutenberg Books (Hart, 2007) and evaluate them in the same validation set as Setting 1. The paper mentions using C4 and Gutenberg Books, and evaluating on a 'validation set from C4', but does not specify the explicit split percentages, sample counts, or methodology for training/validation/test sets for reproducibility across the main datasets. |
| Hardware Specification | Yes | Our main experiments are conducted with 4 cards NVIDIA A6000 (maximum GPU memory=48GB). |
| Software Dependencies | No | The provided pseudocode snippets indicate the use of 'torch' (PyTorch) functions like 'torch.randn', 'torch.einsum', 'F.pad', 'torch.cat', and 'torch.eye'. However, specific version numbers for PyTorch or any other software dependencies are not explicitly mentioned in the paper. |
| Experiment Setup | Yes | For all model scales and experimental settings, we select 6e-4 as the learning rate and warm-up for 10000 steps with cosine scheduler. While the mini-batchsize on each device is different for each model, we accumulate gradients until the global batchsize reaches 1024 in all experiments. We fine-tune Smol LM-1.7B with approximately 350k samples for one epoch, using the Adam W optimizer with a learning rate of 3e-4 and a cosine scheduler with a warmup ratio of 0.1. ...setting σ = 0.3 for 60M model obtain the best perplexity, especially for longer context. The best σ implies the estimated strength of Spectrum Damage of the 60M model, and the estimation may become larger as the models parameter scale increases. ...Setting D = 64 can obtain the best accuracy for Passkey Retrieval. |