Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

Authors: Haebin Shin, Lei Ji, Xiao Liu, Yeyun Gong

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Voc Agno LM by continual pretraining Tiny LLa MA 1.1B (Zhang et al., 2024a) using 7B teacher models built on a different vocabulary system, such as Mistral (Jiang et al., 2023), Deep Seek (Deep Seek-AI et al., 2024), or Qwen2.5 (Qwen et al., 2024). Furthermore, we demonstrate that Voc Agno LM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling. Table 1: Performance Comparison of Student Model (S) Guided by Various Teacher Models.
Researcher Affiliation Collaboration 1Microsoft Research 2KAIST AI. Correspondence to: Haebin Shin <EMAIL>, Lei Ji <EMAIL>, Xiao Liu <EMAIL>, Yeyun Gong <EMAIL>.
Pseudocode Yes Algorithm 1 Token-level Lexical Alignment
Open Source Code No The paper discusses the use of 'Lit GPT (Lightning-AI, 2023)' as a tool for continual pretraining, with a link to its repository, but it does not provide an explicit statement or link for the open-sourcing of the 'Voc Agno LM' methodology described in this paper.
Open Datasets Yes We utilize Open Web Math (Paster et al., 2024), containing about 15 billion tokens sourced from math-related web pages in the Common Crawl.
Dataset Splits No The paper mentions using 'Open Web Math (Paster et al., 2024)' for pretraining and lists several mathematical reasoning benchmarks for evaluation (e.g., GSM8k, MATH), but it does not explicitly specify the training, validation, or test splits used for these datasets within the paper. It refers to 'few-shot chain-of-thought (Co T) examples following the settings in Lin et al. (2024); Zhou et al. (2024)', implying external references, but no direct split details are provided in the main text.
Hardware Specification Yes Training is conducted on 32 H100 GPUs with a cosine learning rate scheduler (decaying from 8e-5 to 8e-6), a sequence length of 2048, and a global batch size of 2M tokens, following prior works (Zhang et al., 2024a; Lin et al., 2024; Zhou et al., 2024).
Software Dependencies No The paper states 'We use Lit GPT (Lightning-AI, 2023) to continually pretrain...', mentioning a specific tool but without providing its version number or any other software dependencies with specific versions.
Experiment Setup Yes Training is conducted on 32 H100 GPUs with a cosine learning rate scheduler (decaying from 8e-5 to 8e-6), a sequence length of 2048, and a global batch size of 2M tokens, following prior works (Zhang et al., 2024a; Lin et al., 2024; Zhou et al., 2024). We apply a top-k threshold of 40%. Details are described in Appendix B.