From Language Models over Tokens to Language Models over Characters

Authors: Tim Vieira, Benjamin Lebrun, Mario Giulianelli, Juan Luis Gastaldi, Brian Dusell, John Terilla, Timothy J. O’Donnell, Ryan Cotterell

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that even with a small computation budget our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model s compression rate (bits/byte) is achieved.
Researcher Affiliation Academia 1ETH Zürich 2Mila 3City University of New York 4Mc Gill University 5Canada CIFAR AI Chair . Correspondence to: Tim Vieira <EMAIL>.
Pseudocode Yes 1 def conditional_token_generation(σ): 2 while True: 3 sample δ p 4 if κ(δ) σ: return δ # accept
Open Source Code Yes https://github.com/genlm/genlm-bytes
Open Datasets Yes We use the wikitext-103-v1 corpus as a source of character strings; we used the version in the datasets library.
Dataset Splits No We use the wikitext-103-v1 corpus as a source of character strings; we used the version in the datasets library. Specifically, we use the test portion. While it mentions using the 'test portion', it does not specify how the train/validation/test splits were created or if standard splits were used (e.g., percentages, counts, or a citation to a specific split configuration). It only says it uses the test set.
Hardware Specification Yes Experiments were run on an L40S GPU with 40GB of memory.
Software Dependencies No We use the following publicly available models: Llama-3.2-1B, Meta-Llama-3.1-8B, Deep Seek-R1-Distill Llama-8B, and phi-4 (14B) from the transformers library (Wolf et al., 2020). We use the vllm library (Kwon et al., 2023) backend to perform the efficient, batched evaluation of transformer language models on GPUs. No specific version numbers are provided for 'transformers' or 'vllm' libraries, only the years of their corresponding papers.
Experiment Setup Yes We measure the approximation error as the average Jensen Shannon distance (JSD) to a reference model s conditional distribution over the next byte (Fig. 1a). We use a large beam K =128 as a reference model. We evaluate the average surprisal ( log2 probability) of our model s estimated conditional distribution over the next byte in the corpus. As a baseline, we use the average surprisal (bits/bytes) of the canonical tokenization under the token-level language model (Fig. 1b). Error is computed between the character-level conditional distributions with beam sizes K {2, 4, 8, 16, 32, 64} and a reference distribution computed using a much larger value of K = 128.