3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding

Authors: Xindian Ma, Wenyuan Liu, Peng Zhang, Nan Xu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have conducted experiments on long-context Natural Language Understanding (NLU) and long sequence Language Modeling (LM) tasks. From the experimental results, 3D-RPE achieved performance improvements over Ro PE, especially in long-context NLU tasks.
Researcher Affiliation Collaboration 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2Beijing Wenge Technology Co. EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods using mathematical formulations and descriptive text, but no explicit pseudocode or algorithm blocks are present.
Open Source Code Yes Our code, data, and appendix are available on Git Hub (https://github.com/maxindian/3D-RPE-Long-Contex-Modeling)
Open Datasets Yes In the context of long-context NLU tasks, we employ the Long Alpaca-12k dataset, which contains 9,000 Long QA and 3,000 short QA entries (Chen et al. 2023c), and the Long Alpace-16k-length dataset (Chen et al. 2023b). To evaluate the performance of 3D-RPE for long-context extension, we use the Long Bench (Bai et al. 2023)... Additionally, the LEval (An et al. 2023) evaluation set... For long-sequence LM tasks, we use the Red Pajama Data (Computer 2023) for fine-tuning training... For evaluation, we utilize the PG19 book corpus dataset (Rae et al. 2020), which includes 100 documents, and the Arxiv Math Proofpile dataset (test split).
Dataset Splits Yes For long-sequence LM tasks, we use the Red Pajama Data (Computer 2023) for fine-tuning training. The dataset is a large-scale pre-training dataset (the size reaches 1.2 trillion tokens)... We sample 20,000 samples from these data sources for training. For evaluation, we utilize the PG19 book corpus dataset (Rae et al. 2020), which includes 100 documents, and the Arxiv Math Proofpile dataset (test split).
Hardware Specification Yes Training was conducted on a single 4x A800 GPU machine using Flash Attention-2 (Dao 2023).
Software Dependencies No The paper mentions software like 'Adam W optimizer' and 'Flash Attention-2' and models such as 'LLa MA2', but it does not specify version numbers for these or other general software dependencies like programming languages or libraries.
Experiment Setup Yes The training step is 3,000. For the long-sequence Language Modeling (LM) tasks, ... The training step is 1,000. We set the per-device batch size as 1, and gradient accumulation step as 8, which means that the batch size is 8. We train the model with the next token prediction objective with Lo RA (Hu et al. 2022). We employed the Adam W optimizer (Loshchilov and Hutter 2019) with β1 = 0.9 and β2 = 0.95 for all fine-tuned models. Chunk size is set to 3k. The learning rate was set to 2 10 5, and a linear learning rate warmup was applied.