reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Language Models over Tokens to Language Models over Characters

Authors: Tim Vieira, Benjamin Lebrun, Mario Giulianelli, Juan Luis Gastaldi, Brian Dusell, John Terilla, Timothy J. O’Donnell, Ryan Cotterell

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that even with a small computation budget our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model s compression rate (bits/byte) is achieved.
Researcher Affiliation	Academia	1ETH Zürich 2Mila 3City University of New York 4Mc Gill University 5Canada CIFAR AI Chair . Correspondence to: Tim Vieira <EMAIL>.
Pseudocode	Yes	1 def conditional_token_generation(σ): 2 while True: 3 sample δ p 4 if κ(δ) σ: return δ # accept
Open Source Code	Yes	https://github.com/genlm/genlm-bytes
Open Datasets	Yes	We use the wikitext-103-v1 corpus as a source of character strings; we used the version in the datasets library.
Dataset Splits	No	We use the wikitext-103-v1 corpus as a source of character strings; we used the version in the datasets library. Specifically, we use the test portion. While it mentions using the 'test portion', it does not specify how the train/validation/test splits were created or if standard splits were used (e.g., percentages, counts, or a citation to a specific split configuration). It only says it uses the test set.
Hardware Specification	Yes	Experiments were run on an L40S GPU with 40GB of memory.
Software Dependencies	No	We use the following publicly available models: Llama-3.2-1B, Meta-Llama-3.1-8B, Deep Seek-R1-Distill Llama-8B, and phi-4 (14B) from the transformers library (Wolf et al., 2020). We use the vllm library (Kwon et al., 2023) backend to perform the efficient, batched evaluation of transformer language models on GPUs. No specific version numbers are provided for 'transformers' or 'vllm' libraries, only the years of their corresponding papers.
Experiment Setup	Yes	We measure the approximation error as the average Jensen Shannon distance (JSD) to a reference model s conditional distribution over the next byte (Fig. 1a). We use a large beam K =128 as a reference model. We evaluate the average surprisal ( log2 probability) of our model s estimated conditional distribution over the next byte in the corpus. As a baseline, we use the average surprisal (bits/bytes) of the canonical tokenization under the token-level language model (Fig. 1b). Error is computed between the character-level conditional distributions with beam sizes K {2, 4, 8, 16, 32, 64} and a reference distribution computed using a much larger value of K = 128.