Rethinking Addressing in Language Models via Contextualized Equivariant Positional Encoding

Authors: Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. In this section, we first validate our method on arithmetic tasks, which relies on better position-addressing ability for prediction (Sec. 5.1). We also show our effectiveness in natural languages, in both pre-training (Sec. 5.2) and fine-tuning case (Sec. 3.3). More experiments, visualization, and model interpretation can be found in Appendix D and E.
Researcher Affiliation Academia 1University of Texas at Austin 2Zhejiang University 3Princeton University 4Georgia Tech.
Pseudocode No The paper describes its methodology using mathematical equations and textual explanations but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/VITA-Group/TAPE
Open Datasets Yes To evaluate the ability of LLMs of performing arithmetic tasks with our position embedding, we use the Addition Bucket 40 dataset (Mc Leish et al., 2024a)... Similarly, we first pre-train transformers with 1024 context window from scratch, using C4 dataset (Raffel et al., 2020), and then fine-tune those models in long-context benchmark SCROLLS (Shaham et al., 2022)... We extend the context window of the pre-trained Llama2 7B model (Gen AI, 2023) from 4096 to 8192, using the Redpajama (Computer, 2023). For validation, we then compare the perplexity on sequence of length 8192, on the cleaned Ar Xiv Math proof-pile dataset (Azerbayev et al., 2022; Chen et al., 2023a) and the book corpus dataset PG19 (Rae et al., 2019).
Dataset Splits Yes We train transformers from scratch using the arithmetic data, and during evaluation, we sample 100 samples for each pair of operand lengths. For validation, we then compare the perplexity on sequence of length 8192, on the cleaned Ar Xiv Math proof-pile dataset (Azerbayev et al., 2022; Chen et al., 2023a) and the book corpus dataset PG19 (Rae et al., 2019). We evaluate pre-trained models perplexity across varying sequence lengths on the Git Hub test set.
Hardware Specification Yes on a single A100 GPU.
Software Dependencies No The paper mentions software components like 'Flash Attention' but does not provide specific version numbers for any key software dependencies.
Experiment Setup Yes The training recipe in three experiments are presented in Tab. 5. Table 5: Training recipe for language model pre-training and fine-tuning in experiments. Arithmetic ( 5.1): Sequence length 40 + 40, Batch size 512, Number of iterations 20k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 1 10-4. C4 Pre-training ( 5.2): Sequence length 1024, Batch size 512, Number of iterations 10k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 1 10-4. SCROLLS ( 5.2): Sequence length 1024, Batch size 64, Number of iterations 1k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 1 10-5. Context Extension ( 5.3): Sequence length 8096, Batch size 64, Number of iterations 1k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 2 10-5.