Positional Encoding Helps Recurrent Neural Networks Handle a Large Vocabulary
Authors: Takashi Morita
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Nonetheless, investigations through synthetic benchmarks reveal an advantage of coupling positional encoding and RNNs, especially for handling a large vocabulary that yields lowfrequency tokens. Further scrutinization unveils that these low-frequency tokens destabilizes the gradients of vanilla RNNs, and the positional encoding resolves this instability. These results shed a new light on the utility of positional encoding beyond its canonical role as a timekeeper for Transformers. |
| Researcher Affiliation | Academia | Takashi Morita EMAIL Academy of Emerging Sciences | Center for Mathematical Science and AI Chubu University |
| Pseudocode | No | The paper describes methods in regular paragraph text and uses mathematical equations for definitions, but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | All the experiments were implemented in Py Torch (ver. 2.1.1; Paszke et al., 2017; 2019) and each trainingtest trial was executed on a single NVIDIA A100 GPU (with 80GB VRAM) hosted by the Academic Center for Computing and Media Studies, Kyoto University. The source code is available in https://github.com/tkc-morita/position-encoded_rnn. |
| Open Datasets | Yes | This section reports benchmark results for the language modeling task. Single-layer LSTMs with and without sinusoidal positional encoding were trained and tested on the Wiki Text-103 dataset (Merity et al., 2017). |
| Dataset Splits | Yes | Each of the five trials held out 1024 random sequences (= 65,536 tokens) for computing the test accuracy. |
| Hardware Specification | Yes | All the experiments were implemented in Py Torch (ver. 2.1.1; Paszke et al., 2017; 2019) and each trainingtest trial was executed on a single NVIDIA A100 GPU (with 80GB VRAM) hosted by the Academic Center for Computing and Media Studies, Kyoto University. |
| Software Dependencies | Yes | All the experiments were implemented in Py Torch (ver. 2.1.1; Paszke et al., 2017; 2019) |
| Experiment Setup | Yes | The models were trained for 300,000 iterations using the Adam optimizer (Kingma & Ba, 2015) with the parameters (β1, β2) := (0.9, 0.999) and no weight decay. The learning rate was linearly warmed up from 0.0 to 0.001 for the first 1,000 iterations, and then annealed according to the cosine schedule (Loshchilov & Hutter, 2017). The batch size was 512. |