Scaling Laws for Downstream Task Performance in Machine Translation
Authors: Berivan Isik, NATALIA PONOMAREVA, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, Sanmi Koyejo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream translation quality metrics with good accuracy using a log-law. However, there are cases where moderate misalignment causes the downstream translation scores to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these, we provide new practical insights for choosing appropriate pretraining data. |
| Researcher Affiliation | Collaboration | Google Research, Open AI, Stanford University EMAIL |
| Pseudocode | No | The paper describes methods and equations but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The reproducibility statement says: "We used publicly available datasets and models, and specified their versions with proper citations in Section 4 and Appendix A. We provided details on the training procedure and hyperparameters for both pretraining and finetuning stages." This statement confirms the use of public datasets and models but does not explicitly state that the authors' own implementation code for the methodology described in the paper is available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We use the English (en), German (de), French (fr), and Romanian (ro) portions of the MC4 dataset (Raffel et al., 2020). We finetune the pretrained models on WMT-17 en-de (Bojar et al., 2017), WMT-15 en-fr (Bojar et al., 2014), and WMT-16 en-ro (Bojar et al., 2016), separately. In Appendix B, we provide additional experimental results to demonstrate that the proposed scaling law is applicable to tasks beyond translation as well. For this, we analyze models pretrained on en-MC4 and finetuned on Super GLUE (Wang et al., 2019)... |
| Dataset Splits | No | To understand the effect of the finetuning data size on scaling, we sometimes use a smaller randomly sampled portion from these translation datasets and indicate the number of tokens used in the plots. The paper specifies total token counts for fine-tuning datasets (e.g., WMT-17 en-de with 3B tokens) and mentions sampling smaller portions, but it does not provide specific train/validation/test splits (e.g., percentages or sample counts) for these datasets. |
| Hardware Specification | Yes | For the T5-3B experiments, pretraining for 1M steps takes 15-20 hours and finetuning takes 5-7 hours on an 8x8 TPU. |
| Software Dependencies | No | We use the 3-billion encoder-decoder T5 model... For encoding the text as Word Piece tokens (Sennrich et al., 2016; Kudo, 2018), we use Sentence Piece (Kudo & Richardson, 2018) trained with a vocabulary of size 250, 112 that covers all the languages in the MC4 dataset (Raffel et al., 2020). Following Raffel et al. (2020), we use an inverse square root learning rate schedule... In both stages, we use Ada Factor optimizer (Shazeer & Stern, 2018). The paper mentions several tools and models but does not provide specific version numbers for the software packages used (e.g., TensorFlow 2.x, PyTorch 1.x). |
| Experiment Setup | Yes | During pretraining, we use a batch size of 256 and a sequence length of 512 for 1, 000, 000 steps except for the ro-MC4 pretraining. For ro-MC4, we pretrain for 510, 000 steps since otherwise, we would need to do repetitions over the sequences. Following Raffel et al. (2020), we use an inverse square root learning rate schedule, 1 / max(n,k), where n is the current pretraining step and k = 104. We do a grid search for the base learning rate from {0.05, 0.1, 0.5, 1.0, 2.0, 5.0} and pick the best one for each pretrained model based on upstream cross entropy. We perform full-weight finetuning. During finetuning, again following Raffel et al. (2020), we use a batch size of 128 and a sequence length of 512 for 300 steps. We use a constant learning rate by selecting the best from {0.001, 0.005, 0.01, 0.05, 0.1}. In both stages, we use Ada Factor optimizer (Shazeer & Stern, 2018). |