Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Authors: Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Zhou Xun
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs. We conduct ablation study basing on the setup of OLMo E1.3B. |
| Researcher Affiliation | Industry | 1Bytedance Seed. Correspondence to: Hongzhi Huang <EMAIL>. |
| Pseudocode | Yes | A. Pytorch Implementation. We provide a pytorch-like pseudocode for Over-Encoding in Algorithm 1. |
| Open Source Code | No | The paper includes pseudocode in Appendix A but does not provide a specific link to a code repository or an explicit statement about releasing runnable source code for the described methodology. |
| Open Datasets | Yes | We report the average perplexities (PPL) and losses on the c4 en-validation dataset... The result is ploted in Figure 1. From comparisons across model scales, OE models hold consistent improvements from baseline models... We follow the experimental setup described in OLMo2 (OLMo et al., 2024)... We present a detailed training dynamics comparison for models on OLMo2-1B in Figure 11. OE achieves significant improvements across most metrics... Detailed descriptions on these tasks and more comprehensive evaluations can be found in Appendix B. For instance, 'piqa (Bisk et al., 2020)', 'hellaswag (Zellers et al., 2019)', 'arc easy (Clark et al., 2018)', 'arc challenge (Clark et al., 2018)', 'mmlu (Hendrycks et al., 2021)' are used. |
| Dataset Splits | Yes | We train OLMo2-151M and OLMo2-400M models with 400B tokens, and OLMo2-1B with 1T tokens... We report the average perplexities (PPL) and losses on the c4 en-validation dataset as the Eval PPL or Eval Loss... 20 million sentences are sampled from the grammar, serving as a fixed training dataset. To evaluate model performance, we 10,000 sample sentences from the trained model auto-regressively using the raw next token probability. The sentences are then verified by the ground-truth grammar, and the ratio of accurate samples is denoted as generation accuracy. |
| Hardware Specification | Yes | Table 6: Training throughputs for OE and baseline. We report average tokens per second in millions. OLMo E-1.3B Hardware 32 A100. OLMo E-7B Hardware 64 A100. For inference efficiency, we test the prefill and decoding throughput on a single A100 GPU using the transformers library. |
| Software Dependencies | No | The paper mentions using the 'transformers library' and 'Pytorch-like pseudocode' but does not specify version numbers for these or any other key software components required for replication. |
| Experiment Setup | Yes | We train OLMo2-151M and OLMo2-400M models with 400B tokens, and OLMo2-1B with 1T tokens... We use Adam W optimizer with beta = (0.9, 0.98), weight decay 0.1, initial learning rate 3e-4 and batch size 64 * 8. The models are trained with cosine learning rate scheduler for 10 epoch. |