reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding

Authors: Xindian Ma, Wenyuan Liu, Peng Zhang, Nan Xu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We have conducted experiments on long-context Natural Language Understanding (NLU) and long sequence Language Modeling (LM) tasks. From the experimental results, 3D-RPE achieved performance improvements over Ro PE, especially in long-context NLU tasks.
Researcher Affiliation	Collaboration	1College of Intelligence and Computing, Tianjin University, Tianjin, China 2Beijing Wenge Technology Co. EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods using mathematical formulations and descriptive text, but no explicit pseudocode or algorithm blocks are present.
Open Source Code	Yes	Our code, data, and appendix are available on Git Hub (https://github.com/maxindian/3D-RPE-Long-Contex-Modeling)
Open Datasets	Yes	In the context of long-context NLU tasks, we employ the Long Alpaca-12k dataset, which contains 9,000 Long QA and 3,000 short QA entries (Chen et al. 2023c), and the Long Alpace-16k-length dataset (Chen et al. 2023b). To evaluate the performance of 3D-RPE for long-context extension, we use the Long Bench (Bai et al. 2023)... Additionally, the LEval (An et al. 2023) evaluation set... For long-sequence LM tasks, we use the Red Pajama Data (Computer 2023) for fine-tuning training... For evaluation, we utilize the PG19 book corpus dataset (Rae et al. 2020), which includes 100 documents, and the Arxiv Math Proofpile dataset (test split).
Dataset Splits	Yes	For long-sequence LM tasks, we use the Red Pajama Data (Computer 2023) for fine-tuning training. The dataset is a large-scale pre-training dataset (the size reaches 1.2 trillion tokens)... We sample 20,000 samples from these data sources for training. For evaluation, we utilize the PG19 book corpus dataset (Rae et al. 2020), which includes 100 documents, and the Arxiv Math Proofpile dataset (test split).
Hardware Specification	Yes	Training was conducted on a single 4x A800 GPU machine using Flash Attention-2 (Dao 2023).
Software Dependencies	No	The paper mentions software like 'Adam W optimizer' and 'Flash Attention-2' and models such as 'LLa MA2', but it does not specify version numbers for these or other general software dependencies like programming languages or libraries.
Experiment Setup	Yes	The training step is 3,000. For the long-sequence Language Modeling (LM) tasks, ... The training step is 1,000. We set the per-device batch size as 1, and gradient accumulation step as 8, which means that the batch size is 8. We train the model with the next token prediction objective with Lo RA (Hu et al. 2022). We employed the Adam W optimizer (Loshchilov and Hutter 2019) with β1 = 0.9 and β2 = 0.95 for all fine-tuned models. Chunk size is set to 3k. The learning rate was set to 2 10 5, and a linear learning rate warmup was applied.