Rethinking Addressing in Language Models via Contextualized Equivariant Positional Encoding
Authors: Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. In this section, we first validate our method on arithmetic tasks, which relies on better position-addressing ability for prediction (Sec. 5.1). We also show our effectiveness in natural languages, in both pre-training (Sec. 5.2) and fine-tuning case (Sec. 3.3). More experiments, visualization, and model interpretation can be found in Appendix D and E. |
| Researcher Affiliation | Academia | 1University of Texas at Austin 2Zhejiang University 3Princeton University 4Georgia Tech. |
| Pseudocode | No | The paper describes its methodology using mathematical equations and textual explanations but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/VITA-Group/TAPE |
| Open Datasets | Yes | To evaluate the ability of LLMs of performing arithmetic tasks with our position embedding, we use the Addition Bucket 40 dataset (Mc Leish et al., 2024a)... Similarly, we first pre-train transformers with 1024 context window from scratch, using C4 dataset (Raffel et al., 2020), and then fine-tune those models in long-context benchmark SCROLLS (Shaham et al., 2022)... We extend the context window of the pre-trained Llama2 7B model (Gen AI, 2023) from 4096 to 8192, using the Redpajama (Computer, 2023). For validation, we then compare the perplexity on sequence of length 8192, on the cleaned Ar Xiv Math proof-pile dataset (Azerbayev et al., 2022; Chen et al., 2023a) and the book corpus dataset PG19 (Rae et al., 2019). |
| Dataset Splits | Yes | We train transformers from scratch using the arithmetic data, and during evaluation, we sample 100 samples for each pair of operand lengths. For validation, we then compare the perplexity on sequence of length 8192, on the cleaned Ar Xiv Math proof-pile dataset (Azerbayev et al., 2022; Chen et al., 2023a) and the book corpus dataset PG19 (Rae et al., 2019). We evaluate pre-trained models perplexity across varying sequence lengths on the Git Hub test set. |
| Hardware Specification | Yes | on a single A100 GPU. |
| Software Dependencies | No | The paper mentions software components like 'Flash Attention' but does not provide specific version numbers for any key software dependencies. |
| Experiment Setup | Yes | The training recipe in three experiments are presented in Tab. 5. Table 5: Training recipe for language model pre-training and fine-tuning in experiments. Arithmetic ( 5.1): Sequence length 40 + 40, Batch size 512, Number of iterations 20k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 1 10-4. C4 Pre-training ( 5.2): Sequence length 1024, Batch size 512, Number of iterations 10k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 1 10-4. SCROLLS ( 5.2): Sequence length 1024, Batch size 64, Number of iterations 1k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 1 10-5. Context Extension ( 5.3): Sequence length 8096, Batch size 64, Number of iterations 1k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 2 10-5. |