reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Laws for Downstream Task Performance in Machine Translation

Authors: Berivan Isik, NATALIA PONOMAREVA, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, Sanmi Koyejo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments indicate that the size of the ﬁnetuning dataset and the distribution alignment between the pretraining and downstream data signiﬁcantly inﬂuence the scaling behavior. With sufﬁcient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream translation quality metrics with good accuracy using a log-law. However, there are cases where moderate misalignment causes the downstream translation scores to ﬂuctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these, we provide new practical insights for choosing appropriate pretraining data.
Researcher Affiliation	Collaboration	Google Research, Open AI, Stanford University EMAIL
Pseudocode	No	The paper describes methods and equations but does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The reproducibility statement says: "We used publicly available datasets and models, and speciﬁed their versions with proper citations in Section 4 and Appendix A. We provided details on the training procedure and hyperparameters for both pretraining and ﬁnetuning stages." This statement confirms the use of public datasets and models but does not explicitly state that the authors' own implementation code for the methodology described in the paper is available, nor does it provide a link to a code repository.
Open Datasets	Yes	We use the English (en), German (de), French (fr), and Romanian (ro) portions of the MC4 dataset (Raffel et al., 2020). We ﬁnetune the pretrained models on WMT-17 en-de (Bojar et al., 2017), WMT-15 en-fr (Bojar et al., 2014), and WMT-16 en-ro (Bojar et al., 2016), separately. In Appendix B, we provide additional experimental results to demonstrate that the proposed scaling law is applicable to tasks beyond translation as well. For this, we analyze models pretrained on en-MC4 and ﬁnetuned on Super GLUE (Wang et al., 2019)...
Dataset Splits	No	To understand the effect of the ﬁnetuning data size on scaling, we sometimes use a smaller randomly sampled portion from these translation datasets and indicate the number of tokens used in the plots. The paper specifies total token counts for fine-tuning datasets (e.g., WMT-17 en-de with 3B tokens) and mentions sampling smaller portions, but it does not provide specific train/validation/test splits (e.g., percentages or sample counts) for these datasets.
Hardware Specification	Yes	For the T5-3B experiments, pretraining for 1M steps takes 15-20 hours and ﬁnetuning takes 5-7 hours on an 8x8 TPU.
Software Dependencies	No	We use the 3-billion encoder-decoder T5 model... For encoding the text as Word Piece tokens (Sennrich et al., 2016; Kudo, 2018), we use Sentence Piece (Kudo & Richardson, 2018) trained with a vocabulary of size 250, 112 that covers all the languages in the MC4 dataset (Raffel et al., 2020). Following Raffel et al. (2020), we use an inverse square root learning rate schedule... In both stages, we use Ada Factor optimizer (Shazeer & Stern, 2018). The paper mentions several tools and models but does not provide specific version numbers for the software packages used (e.g., TensorFlow 2.x, PyTorch 1.x).
Experiment Setup	Yes	During pretraining, we use a batch size of 256 and a sequence length of 512 for 1, 000, 000 steps except for the ro-MC4 pretraining. For ro-MC4, we pretrain for 510, 000 steps since otherwise, we would need to do repetitions over the sequences. Following Raffel et al. (2020), we use an inverse square root learning rate schedule, 1 / max(n,k), where n is the current pretraining step and k = 104. We do a grid search for the base learning rate from {0.05, 0.1, 0.5, 1.0, 2.0, 5.0} and pick the best one for each pretrained model based on upstream cross entropy. We perform full-weight ﬁnetuning. During ﬁnetuning, again following Raffel et al. (2020), we use a batch size of 128 and a sequence length of 512 for 300 steps. We use a constant learning rate by selecting the best from {0.001, 0.005, 0.01, 0.05, 0.1}. In both stages, we use Ada Factor optimizer (Shazeer & Stern, 2018).