Wavelet-based Positional Representation for Long Context

Authors: Yui Oka, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that this new method improves the performance of the model in both short and long contexts. In particular, our method allows extrapolation of position information without limiting the model s attention field. ... From our experiments on extrapolation capabilities using the wikitext-103 dataset (Merity et al., 2017), the results demonstrate that our method surpasses traditional positional encoding methods in perplexity.
Researcher Affiliation Industry Yui Oka, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito NTT Human Informatics Laboratories, NTT Corporation EMAIL
Pseudocode No The paper describes mathematical formulas and methodological steps, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The implementation was based on the fairseq (Ott et al., 2019)-based code6 provided in a previous work(Press et al., 2022), and all hyperparameters were set to the same values as those in the literature(Press et al., 2022). Footnote 6 provides a link to: https://github.com/ofirpress/attention_with_linear_biases. This link points to the code of a previous work, not the code specifically for the methodology described in this paper.
Open Datasets Yes From our experiments on extrapolation capabilities using the wikitext-103 dataset (Merity et al., 2017)... We pre-trained the Llama-2-7B model (Touvron et al., 2023b) from scratch. For pre-training, we used the Red Pajama dataset (Computer, 2023)... We used Code Parrot 10 for evaluation... Footnote 10: https://huggingface.co/datasets/codeparrot/codeparrot-clean
Dataset Splits No The maximum allowable lengths of sequences were set to Ltrain = 512 and Ltrain = 1024. ... To evaluate sequences longer than Ltrain tokens, it is common to divide the sequence into Ltrain-length sub-sequences, evaluate each independently, and report the average score. ... we evaluated the validation set. ... For pre-training, we used the Red Pajama dataset (Computer, 2023), which selects a 1B-token sample of all samples. The paper describes evaluation strategies and sampling for pre-training, but does not provide specific training/test/validation split percentages or counts for the datasets used in model training.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running the experiments. It only mentions 'learning costs associated with large-scale language models'.
Software Dependencies No The implementation was based on the fairseq (Ott et al., 2019)-based code... We used Pywavelet (Lee et al., 2019) ... We used Adam W(Loshchilov & Hutter, 2019) as the optimizer... While software components like fairseq, Pywavelet, and AdamW are mentioned, specific version numbers for these software libraries are not provided.
Experiment Setup Yes The dimensionality of the word embedding dmodel is 1024, the number of heads N is 8, the dimensionality of the heads d is 128, and the number of layers is 16. The number of training epochs is 205, and the batch size is 9216. The learning rate was set to 1.0, and the learning process was updated by 1e-7 every 16,000 steps. (Appendix A.7) ... For long-context: The dimensionality of the word embedding dmodel is 4096, the number of heads N is 32, the dimensionality of the heads d is 128, and the number of layers is 32. The number of training steps is 30,000, and the batch size is 1. The learning rate was set to 0.0003. We used Adam W(Loshchilov & Hutter, 2019) as the optimizer, with (β1, β2) = (0.9, 0.95). (Section 7.1 and Appendix A.8)