3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding
Authors: Xindian Ma, Wenyuan Liu, Peng Zhang, Nan Xu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have conducted experiments on long-context Natural Language Understanding (NLU) and long sequence Language Modeling (LM) tasks. From the experimental results, 3D-RPE achieved performance improvements over Ro PE, especially in long-context NLU tasks. |
| Researcher Affiliation | Collaboration | 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2Beijing Wenge Technology Co. EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods using mathematical formulations and descriptive text, but no explicit pseudocode or algorithm blocks are present. |
| Open Source Code | Yes | Our code, data, and appendix are available on Git Hub (https://github.com/maxindian/3D-RPE-Long-Contex-Modeling) |
| Open Datasets | Yes | In the context of long-context NLU tasks, we employ the Long Alpaca-12k dataset, which contains 9,000 Long QA and 3,000 short QA entries (Chen et al. 2023c), and the Long Alpace-16k-length dataset (Chen et al. 2023b). To evaluate the performance of 3D-RPE for long-context extension, we use the Long Bench (Bai et al. 2023)... Additionally, the LEval (An et al. 2023) evaluation set... For long-sequence LM tasks, we use the Red Pajama Data (Computer 2023) for fine-tuning training... For evaluation, we utilize the PG19 book corpus dataset (Rae et al. 2020), which includes 100 documents, and the Arxiv Math Proofpile dataset (test split). |
| Dataset Splits | Yes | For long-sequence LM tasks, we use the Red Pajama Data (Computer 2023) for fine-tuning training. The dataset is a large-scale pre-training dataset (the size reaches 1.2 trillion tokens)... We sample 20,000 samples from these data sources for training. For evaluation, we utilize the PG19 book corpus dataset (Rae et al. 2020), which includes 100 documents, and the Arxiv Math Proofpile dataset (test split). |
| Hardware Specification | Yes | Training was conducted on a single 4x A800 GPU machine using Flash Attention-2 (Dao 2023). |
| Software Dependencies | No | The paper mentions software like 'Adam W optimizer' and 'Flash Attention-2' and models such as 'LLa MA2', but it does not specify version numbers for these or other general software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | The training step is 3,000. For the long-sequence Language Modeling (LM) tasks, ... The training step is 1,000. We set the per-device batch size as 1, and gradient accumulation step as 8, which means that the batch size is 8. We train the model with the next token prediction objective with Lo RA (Hu et al. 2022). We employed the Adam W optimizer (Loshchilov and Hutter 2019) with β1 = 0.9 and β2 = 0.95 for all fine-tuned models. Chunk size is set to 3k. The learning rate was set to 2 10 5, and a linear learning rate warmup was applied. |