reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Authors: Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Zhou Xun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs. We conduct ablation study basing on the setup of OLMo E1.3B.
Researcher Affiliation	Industry	1Bytedance Seed. Correspondence to: Hongzhi Huang <EMAIL>.
Pseudocode	Yes	A. Pytorch Implementation. We provide a pytorch-like pseudocode for Over-Encoding in Algorithm 1.
Open Source Code	No	The paper includes pseudocode in Appendix A but does not provide a specific link to a code repository or an explicit statement about releasing runnable source code for the described methodology.
Open Datasets	Yes	We report the average perplexities (PPL) and losses on the c4 en-validation dataset... The result is ploted in Figure 1. From comparisons across model scales, OE models hold consistent improvements from baseline models... We follow the experimental setup described in OLMo2 (OLMo et al., 2024)... We present a detailed training dynamics comparison for models on OLMo2-1B in Figure 11. OE achieves significant improvements across most metrics... Detailed descriptions on these tasks and more comprehensive evaluations can be found in Appendix B. For instance, 'piqa (Bisk et al., 2020)', 'hellaswag (Zellers et al., 2019)', 'arc easy (Clark et al., 2018)', 'arc challenge (Clark et al., 2018)', 'mmlu (Hendrycks et al., 2021)' are used.
Dataset Splits	Yes	We train OLMo2-151M and OLMo2-400M models with 400B tokens, and OLMo2-1B with 1T tokens... We report the average perplexities (PPL) and losses on the c4 en-validation dataset as the Eval PPL or Eval Loss... 20 million sentences are sampled from the grammar, serving as a fixed training dataset. To evaluate model performance, we 10,000 sample sentences from the trained model auto-regressively using the raw next token probability. The sentences are then verified by the ground-truth grammar, and the ratio of accurate samples is denoted as generation accuracy.
Hardware Specification	Yes	Table 6: Training throughputs for OE and baseline. We report average tokens per second in millions. OLMo E-1.3B Hardware 32 A100. OLMo E-7B Hardware 64 A100. For inference efficiency, we test the prefill and decoding throughput on a single A100 GPU using the transformers library.
Software Dependencies	No	The paper mentions using the 'transformers library' and 'Pytorch-like pseudocode' but does not specify version numbers for these or any other key software components required for replication.
Experiment Setup	Yes	We train OLMo2-151M and OLMo2-400M models with 400B tokens, and OLMo2-1B with 1T tokens... We use Adam W optimizer with beta = (0.9, 0.98), weight decay 0.1, initial learning rate 3e-4 and batch size 64 * 8. The models are trained with cosine learning rate scheduler for 10 epoch.