Language Models over Canonical Byte-Pair Encodings

Authors: Tim Vieira, Tianyu Liu, Clemente Pasti, Yahya Emara, Brian Dusell, Benjamin Lebrun, Mario Giulianelli, Juan Luis Gastaldi, Timothy J. O’Donnell, Ryan Cotterell

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora. 5. Experiments This section evaluates our proposed methods canonicality by constraints (global and local; 3) and canonicality by conditioning ( 4) by measuring their impact on real datasets and language models.
Researcher Affiliation Academia 1ETH Zürich 2Mila 3Mc Gill University 4Canada CIFAR AI Chair. Correspondence to: Tim Vieira <EMAIL>.
Pseudocode Yes 1 def rejection_sampling(): 2 while True: 3 δ sample(p ) 4 if δ D: return δ
Open Source Code Yes github.com/genlm/canonical-icml-2025
Open Datasets Yes Penn Treebank (PTB, Marcus et al., 1993) (test split; 3761 strings, 82k words, 439k characters) Wiki Text (Merity et al., 2017) (test split; 4358 strings, 234k words and 1286k characters)
Dataset Splits Yes Penn Treebank (PTB, Marcus et al., 1993) (test split; 3761 strings, 82k words, 439k characters) Wiki Text (Merity et al., 2017) (test split; 4358 strings, 234k words and 1286k characters) We fine-tuned two language models, GPT2S and GPT-2M,17 on the PTB train set and a subset of the Wiki Text train set with 50K strings and 4.2M words.
Hardware Specification No No specific hardware details (like GPU/CPU models) are provided. The mention of 'bfloat16' refers to a data type used for model parameters, not a hardware specification.
Software Dependencies No The paper mentions the 'Adam W optimizer' but does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes We fine-tuned two language models, GPT2S and GPT-2M,17 on the PTB train set and a subset of the Wiki Text train set with 50K strings and 4.2M words. We consider fine-tuning the canonicalized architecture (ℓθ) and the original architecture (pθ ) using the training criterion Fλ for λ {0.001, 0.01, 0.1, 0.2}.18 Each model is trained for 3 epochs using the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 5e 5 and linear learning rate decay. For efficiency, we use bfloat16 to represent the model parameters. We use a minibatch of size 8 for estimating the gradient of each term of the Fλ objective.