From Language Models over Tokens to Language Models over Characters
Authors: Tim Vieira, Benjamin Lebrun, Mario Giulianelli, Juan Luis Gastaldi, Brian Dusell, John Terilla, Timothy J. O’Donnell, Ryan Cotterell
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that even with a small computation budget our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model s compression rate (bits/byte) is achieved. |
| Researcher Affiliation | Academia | 1ETH Zürich 2Mila 3City University of New York 4Mc Gill University 5Canada CIFAR AI Chair . Correspondence to: Tim Vieira <EMAIL>. |
| Pseudocode | Yes | 1 def conditional_token_generation(σ): 2 while True: 3 sample δ p 4 if κ(δ) σ: return δ # accept |
| Open Source Code | Yes | https://github.com/genlm/genlm-bytes |
| Open Datasets | Yes | We use the wikitext-103-v1 corpus as a source of character strings; we used the version in the datasets library. |
| Dataset Splits | No | We use the wikitext-103-v1 corpus as a source of character strings; we used the version in the datasets library. Specifically, we use the test portion. While it mentions using the 'test portion', it does not specify how the train/validation/test splits were created or if standard splits were used (e.g., percentages, counts, or a citation to a specific split configuration). It only says it uses the test set. |
| Hardware Specification | Yes | Experiments were run on an L40S GPU with 40GB of memory. |
| Software Dependencies | No | We use the following publicly available models: Llama-3.2-1B, Meta-Llama-3.1-8B, Deep Seek-R1-Distill Llama-8B, and phi-4 (14B) from the transformers library (Wolf et al., 2020). We use the vllm library (Kwon et al., 2023) backend to perform the efficient, batched evaluation of transformer language models on GPUs. No specific version numbers are provided for 'transformers' or 'vllm' libraries, only the years of their corresponding papers. |
| Experiment Setup | Yes | We measure the approximation error as the average Jensen Shannon distance (JSD) to a reference model s conditional distribution over the next byte (Fig. 1a). We use a large beam K =128 as a reference model. We evaluate the average surprisal ( log2 probability) of our model s estimated conditional distribution over the next byte in the corpus. As a baseline, we use the average surprisal (bits/bytes) of the canonical tokenization under the token-level language model (Fig. 1b). Error is computed between the character-level conditional distributions with beam sizes K {2, 4, 8, 16, 32, 64} and a reference distribution computed using a much larger value of K = 128. |