Language Models over Canonical Byte-Pair Encodings
Authors: Tim Vieira, Tianyu Liu, Clemente Pasti, Yahya Emara, Brian Dusell, Benjamin Lebrun, Mario Giulianelli, Juan Luis Gastaldi, Timothy J. O’Donnell, Ryan Cotterell
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora. 5. Experiments This section evaluates our proposed methods canonicality by constraints (global and local; 3) and canonicality by conditioning ( 4) by measuring their impact on real datasets and language models. |
| Researcher Affiliation | Academia | 1ETH Zürich 2Mila 3Mc Gill University 4Canada CIFAR AI Chair. Correspondence to: Tim Vieira <EMAIL>. |
| Pseudocode | Yes | 1 def rejection_sampling(): 2 while True: 3 δ sample(p ) 4 if δ D: return δ |
| Open Source Code | Yes | github.com/genlm/canonical-icml-2025 |
| Open Datasets | Yes | Penn Treebank (PTB, Marcus et al., 1993) (test split; 3761 strings, 82k words, 439k characters) Wiki Text (Merity et al., 2017) (test split; 4358 strings, 234k words and 1286k characters) |
| Dataset Splits | Yes | Penn Treebank (PTB, Marcus et al., 1993) (test split; 3761 strings, 82k words, 439k characters) Wiki Text (Merity et al., 2017) (test split; 4358 strings, 234k words and 1286k characters) We fine-tuned two language models, GPT2S and GPT-2M,17 on the PTB train set and a subset of the Wiki Text train set with 50K strings and 4.2M words. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models) are provided. The mention of 'bfloat16' refers to a data type used for model parameters, not a hardware specification. |
| Software Dependencies | No | The paper mentions the 'Adam W optimizer' but does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | We fine-tuned two language models, GPT2S and GPT-2M,17 on the PTB train set and a subset of the Wiki Text train set with 50K strings and 4.2M words. We consider fine-tuning the canonicalized architecture (ℓθ) and the original architecture (pθ ) using the training criterion Fλ for λ {0.001, 0.01, 0.1, 0.2}.18 Each model is trained for 3 epochs using the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 5e 5 and linear learning rate decay. For efficiency, we use bfloat16 to represent the model parameters. We use a minibatch of size 8 for estimating the gradient of each term of the Fλ objective. |