MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Authors: Julie Kallini, Shikhar Murty, Christopher Manning, Christopher Potts, Róbert Csordás
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that Mr T5 achieves lower bits-per-byte than both random and fixed token deletion baselines, as well as pooling-based alternatives, at the same compression rates. With multilingual training, Mr T5 adjusts to each language s orthographic features, learning optimal compression rates specific to each language. Finally, in multilingual and character-level benchmarks (Section 6), Mr T5 achieves comparable accuracy to By T5 while cutting the sequence length by up to 75%, significantly improving inference runtimes. |
| Researcher Affiliation | Academia | Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, R obert Csord as Stanford University EMAIL |
| Pseudocode | No | The paper describes methods and formulas using mathematical notation (e.g., Equation 1 in Section 3.1 for the delete gate, or equations in Appendix B for Gumbel-Sigmoid) but does not present them in structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | REPRODUCIBILITY STATEMENT Steps for reproducing each of our experiments are detailed in Appendix E. Descriptions of the model architectures and training configurations/hyperparameters for our diagnostic task experiments are provided in Appendix E.1; details of the model architectures, span corruption data preprocessing steps, and training configurations/hyperparameters for continued pre-training are provided in Appendix E.2; and training configurations/hyperparameters for fine-tuning on the multilingual and character-level downstream tasks are provided in Appendix E.3. We provide our source code at https://github.com/jkallini/mrt5. |
| Open Datasets | Yes | For continued pre-training, we use the multilingual C4 (m C4) corpus (Raffel et al., 2020; Xue et al., 2021). We first test the cross-lingual capabilities of Mr T5 using the Cross-lingual Natural Language Inference (XNLI) corpus (Conneau et al., 2018). Our second cross-lingual evaluation employs the Ty Di QA Gold Passage Task (Ty Di QA-Gold P, Clark et al., 2020). We fine-tune and evaluate Mr T5 and baseline models on the Spelling Correction with Context and Word Search character-level tasks from Huang et al. (2023). |
| Dataset Splits | Yes | For evaluation, we sample each language s test set from its m C4 validation split. Each language is tested on 10,000 examples, except for Swahili and Urdu, which only have 2,800 and 9,300 examples in their validation splits, respectively. We sample a disjoint sample of 16,000 examples of English C4 to use as a validation set during training. For Ty Di QA-Gold P, the multilingual training set is split into 80% for training and 20% for validation, and evaluation is performed on the separate Ty Di QA-Gold P test set. For the character-level tasks, we use the provided train/validation sets for training and evaluate on each task s respective test split. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU or CPU models, memory specifications, or cloud instance types. It mentions model sizes and computational savings but not the underlying hardware. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' but does not specify version numbers for any programming languages, libraries (e.g., PyTorch, TensorFlow), or other software components. |
| Experiment Setup | Yes | We use a batch size of 128 examples and a sequence length of 64 tokens, and we train each model for a total of 30,000 gradient steps. We use the Adam W optimizer with a learning rate that linearly warms up to 1e 4 over 3,000 steps and linearly decays. Table 6: Fine-tuning and evaluation details for the XNLI, Ty Di QA-Gold P, Spelling Correction, and Word Search downstream tasks. The maximum sequence length shown is for the encoder. LR denotes the initial learning rate used during fine-tuning. This table provides specific values for Steps, Epochs, Batch Size, Max Seq. Length, LR, and PI Controller parameters for each task. |