reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

Authors: Haebin Shin, Lei Ji, Xiao Liu, Yeyun Gong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Voc Agno LM by continual pretraining Tiny LLa MA 1.1B (Zhang et al., 2024a) using 7B teacher models built on a different vocabulary system, such as Mistral (Jiang et al., 2023), Deep Seek (Deep Seek-AI et al., 2024), or Qwen2.5 (Qwen et al., 2024). Furthermore, we demonstrate that Voc Agno LM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling. Table 1: Performance Comparison of Student Model (S) Guided by Various Teacher Models.
Researcher Affiliation	Collaboration	1Microsoft Research 2KAIST AI. Correspondence to: Haebin Shin <EMAIL>, Lei Ji <EMAIL>, Xiao Liu <EMAIL>, Yeyun Gong <EMAIL>.
Pseudocode	Yes	Algorithm 1 Token-level Lexical Alignment
Open Source Code	No	The paper discusses the use of 'Lit GPT (Lightning-AI, 2023)' as a tool for continual pretraining, with a link to its repository, but it does not provide an explicit statement or link for the open-sourcing of the 'Voc Agno LM' methodology described in this paper.
Open Datasets	Yes	We utilize Open Web Math (Paster et al., 2024), containing about 15 billion tokens sourced from math-related web pages in the Common Crawl.
Dataset Splits	No	The paper mentions using 'Open Web Math (Paster et al., 2024)' for pretraining and lists several mathematical reasoning benchmarks for evaluation (e.g., GSM8k, MATH), but it does not explicitly specify the training, validation, or test splits used for these datasets within the paper. It refers to 'few-shot chain-of-thought (Co T) examples following the settings in Lin et al. (2024); Zhou et al. (2024)', implying external references, but no direct split details are provided in the main text.
Hardware Specification	Yes	Training is conducted on 32 H100 GPUs with a cosine learning rate scheduler (decaying from 8e-5 to 8e-6), a sequence length of 2048, and a global batch size of 2M tokens, following prior works (Zhang et al., 2024a; Lin et al., 2024; Zhou et al., 2024).
Software Dependencies	No	The paper states 'We use Lit GPT (Lightning-AI, 2023) to continually pretrain...', mentioning a specific tool but without providing its version number or any other software dependencies with specific versions.
Experiment Setup	Yes	Training is conducted on 32 H100 GPUs with a cosine learning rate scheduler (decaying from 8e-5 to 8e-6), a sequence length of 2048, and a global batch size of 2M tokens, following prior works (Zhang et al., 2024a; Lin et al., 2024; Zhou et al., 2024). We apply a top-k threshold of 40%. Details are described in Appendix B.