Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization

Authors: Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, Bowen Zhou

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across various model scales and benchmarks show that, within varying context windows, Fo PE maintains a more stable performance compared to other baselines. Several analyses and ablations bring further support to our method and theoretical modeling. We conduct experiments across several model scales and datasets. The perplexity in pre-training and the accuracy in needle-in-haystack demonstrate Fo PE s superiority over other baselines on length generalization. Evaluation on more complex tasks (i.e. summarization and few-shot question-answering) brings further support to our method and theoretical modeling.
Researcher Affiliation Collaboration 1Tsinghua University 2Northeastern University 3Shanghai Artificial Intelligence Laboratory 4Shanghai Jiaotong University. Correspondence to: Biqing Qi <EMAIL>, Ning Ding <EMAIL>, Bowen Zhou <EMAIL>.
Pseudocode Yes Pseudo-code of Fo PE is shown in the final pages.
Open Source Code Yes https://github.com/Tsinghua C3I/Fourier-Position-Embedding
Open Datasets Yes We train models with a 10B-tokens subset of C4 (Raffel et al., 2020) and evaluate the perplexity in a validation set from C4. Setting 2: We train models with 5B tokens from Gutenberg Books (Hart, 2007)... For summarization, we use Gov Report (Huang et al., 2021) and Multi News (Fabbri et al., 2019). For few-shot question-answering, we use TREC (Li & Roth, 2002), Trivia QA (Joshi et al., 2017), and SAMSum (Gliwa et al., 2019).
Dataset Splits No We train models with a 10B-tokens subset of C4 (Raffel et al., 2020) and evaluate the perplexity in a validation set from C4. ... Setting 2: We train models with 5B tokens from Gutenberg Books (Hart, 2007) and evaluate them in the same validation set as Setting 1. The paper mentions using C4 and Gutenberg Books, and evaluating on a 'validation set from C4', but does not specify the explicit split percentages, sample counts, or methodology for training/validation/test sets for reproducibility across the main datasets.
Hardware Specification Yes Our main experiments are conducted with 4 cards NVIDIA A6000 (maximum GPU memory=48GB).
Software Dependencies No The provided pseudocode snippets indicate the use of 'torch' (PyTorch) functions like 'torch.randn', 'torch.einsum', 'F.pad', 'torch.cat', and 'torch.eye'. However, specific version numbers for PyTorch or any other software dependencies are not explicitly mentioned in the paper.
Experiment Setup Yes For all model scales and experimental settings, we select 6e-4 as the learning rate and warm-up for 10000 steps with cosine scheduler. While the mini-batchsize on each device is different for each model, we accumulate gradients until the global batchsize reaches 1024 in all experiments. We fine-tune Smol LM-1.7B with approximately 350k samples for one epoch, using the Adam W optimizer with a learning rate of 3e-4 and a cosine scheduler with a warmup ratio of 0.1. ...setting σ = 0.3 for 60M model obtain the best perplexity, especially for longer context. The best σ implies the estimated strength of Spectrum Damage of the 60M model, and the estimation may become larger as the models parameter scale increases. ...Setting D = 64 can obtain the best accuracy for Passkey Retrieval.