reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rethinking Addressing in Language Models via Contextualized Equivariant Positional Encoding

Authors: Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. In this section, we first validate our method on arithmetic tasks, which relies on better position-addressing ability for prediction (Sec. 5.1). We also show our effectiveness in natural languages, in both pre-training (Sec. 5.2) and fine-tuning case (Sec. 3.3). More experiments, visualization, and model interpretation can be found in Appendix D and E.
Researcher Affiliation	Academia	1University of Texas at Austin 2Zhejiang University 3Princeton University 4Georgia Tech.
Pseudocode	No	The paper describes its methodology using mathematical equations and textual explanations but does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/VITA-Group/TAPE
Open Datasets	Yes	To evaluate the ability of LLMs of performing arithmetic tasks with our position embedding, we use the Addition Bucket 40 dataset (Mc Leish et al., 2024a)... Similarly, we first pre-train transformers with 1024 context window from scratch, using C4 dataset (Raffel et al., 2020), and then fine-tune those models in long-context benchmark SCROLLS (Shaham et al., 2022)... We extend the context window of the pre-trained Llama2 7B model (Gen AI, 2023) from 4096 to 8192, using the Redpajama (Computer, 2023). For validation, we then compare the perplexity on sequence of length 8192, on the cleaned Ar Xiv Math proof-pile dataset (Azerbayev et al., 2022; Chen et al., 2023a) and the book corpus dataset PG19 (Rae et al., 2019).
Dataset Splits	Yes	We train transformers from scratch using the arithmetic data, and during evaluation, we sample 100 samples for each pair of operand lengths. For validation, we then compare the perplexity on sequence of length 8192, on the cleaned Ar Xiv Math proof-pile dataset (Azerbayev et al., 2022; Chen et al., 2023a) and the book corpus dataset PG19 (Rae et al., 2019). We evaluate pre-trained models perplexity across varying sequence lengths on the Git Hub test set.
Hardware Specification	Yes	on a single A100 GPU.
Software Dependencies	No	The paper mentions software components like 'Flash Attention' but does not provide specific version numbers for any key software dependencies.
Experiment Setup	Yes	The training recipe in three experiments are presented in Tab. 5. Table 5: Training recipe for language model pre-training and fine-tuning in experiments. Arithmetic ( 5.1): Sequence length 40 + 40, Batch size 512, Number of iterations 20k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 1 10-4. C4 Pre-training ( 5.2): Sequence length 1024, Batch size 512, Number of iterations 10k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 1 10-4. SCROLLS ( 5.2): Sequence length 1024, Batch size 64, Number of iterations 1k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 1 10-5. Context Extension ( 5.3): Sequence length 8096, Batch size 64, Number of iterations 1k, Attention dropout prob. 0.0, Optimizer Adam W, Learning rate 2 10-5.