Positional Encoding Helps Recurrent Neural Networks Handle a Large Vocabulary

Authors: Takashi Morita

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Nonetheless, investigations through synthetic benchmarks reveal an advantage of coupling positional encoding and RNNs, especially for handling a large vocabulary that yields lowfrequency tokens. Further scrutinization unveils that these low-frequency tokens destabilizes the gradients of vanilla RNNs, and the positional encoding resolves this instability. These results shed a new light on the utility of positional encoding beyond its canonical role as a timekeeper for Transformers.
Researcher Affiliation Academia Takashi Morita EMAIL Academy of Emerging Sciences | Center for Mathematical Science and AI Chubu University
Pseudocode No The paper describes methods in regular paragraph text and uses mathematical equations for definitions, but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes All the experiments were implemented in Py Torch (ver. 2.1.1; Paszke et al., 2017; 2019) and each trainingtest trial was executed on a single NVIDIA A100 GPU (with 80GB VRAM) hosted by the Academic Center for Computing and Media Studies, Kyoto University. The source code is available in https://github.com/tkc-morita/position-encoded_rnn.
Open Datasets Yes This section reports benchmark results for the language modeling task. Single-layer LSTMs with and without sinusoidal positional encoding were trained and tested on the Wiki Text-103 dataset (Merity et al., 2017).
Dataset Splits Yes Each of the five trials held out 1024 random sequences (= 65,536 tokens) for computing the test accuracy.
Hardware Specification Yes All the experiments were implemented in Py Torch (ver. 2.1.1; Paszke et al., 2017; 2019) and each trainingtest trial was executed on a single NVIDIA A100 GPU (with 80GB VRAM) hosted by the Academic Center for Computing and Media Studies, Kyoto University.
Software Dependencies Yes All the experiments were implemented in Py Torch (ver. 2.1.1; Paszke et al., 2017; 2019)
Experiment Setup Yes The models were trained for 300,000 iterations using the Adam optimizer (Kingma & Ba, 2015) with the parameters (β1, β2) := (0.9, 0.999) and no weight decay. The learning rate was linearly warmed up from 0.0 to 0.001 for the first 1,000 iterations, and then annealed according to the cosine schedule (Loshchilov & Hutter, 2017). The batch size was 512.