reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Laws for Predicting Downstream Performance in LLMs

Authors: Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, Heng Ji

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, this FLP solution accurately predicts the performance of LLMs with 7B and 13B parameters using a series of sampling LMs up to 3B, achieving error margins of 5% and 10%, respectively, and significantly outperforming the FLOPs-to-Performance approach. Further, we present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training, specifically blending general corpus with code data to accurately represent the common necessity. Through comprehensive ablation studies, we validate the design choices in our analytical functions.
Researcher Affiliation	Collaboration	Yangyi Chen1,2, Binxuan Huang2, Yifan Gao2, Zhengyang Wang2, Jingfeng Yang2, Heng Ji2 University of Illinois Urbana-Champaign1, Amazon2 EMAIL, EMAIL
Pseudocode	No	The paper describes the methods FLP and FLP-M in detail with mathematical equations and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using a third-party tool: "We adopt lm-evaluation-harness (Gao et al., 2023b) for unified evaluation." However, it does not provide any explicit statement about releasing the source code for the methodology described in this paper (FLP or FLP-M), nor does it provide a link to a code repository.
Open Datasets	Yes	Pre-Training We use the Red Pajama v1 (Computer, 2023), which consists of 1.2T tokens in total... For general corpus, we use DCLM (Li et al., 2024b), a curated high-quality pre-training corpus... For code data, we use The Stack v2 (Lozhkov et al., 2024), which initially contains over 3B files in 600+ programming and markup languages, created as part of the Big Code project.
Dataset Splits	No	The paper mentions curating a "validation dataset" and using specific benchmarks for "Evaluation." However, it does not provide specific training/test/validation splits (e.g., percentages or exact counts) for its main pre-training datasets (Red Pajama, DCLM, The Stack v2) or how these were used to derive the training data for the sampling LMs.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing configurations used for running the experiments. It only discusses model sizes and FLOPs.
Software Dependencies	No	The paper mentions using "lm-evaluation-harness (Gao et al., 2023b)" and the "Adam optimizer (Diederik, 2014)" but does not specify version numbers for these or any other key software components, such as programming languages, deep learning frameworks, or operating systems.
Experiment Setup	Yes	Table 1: The configurations of the sampling and target LMs with various sizes. HD denotes the hidden dimension, BS denotes the batch size, and LR denotes the learning rate. ... The network is optimized using the regression loss with L2 regularization and the Adam optimizer (Diederik, 2014), employing a learning rate of 0.05 that linearly decays to 0 within 2,000 steps and a weight decay of 0.01.