Scaling Laws for Predicting Downstream Performance in LLMs
Authors: Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, Heng Ji
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, this FLP solution accurately predicts the performance of LLMs with 7B and 13B parameters using a series of sampling LMs up to 3B, achieving error margins of 5% and 10%, respectively, and significantly outperforming the FLOPs-to-Performance approach. Further, we present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training, specifically blending general corpus with code data to accurately represent the common necessity. Through comprehensive ablation studies, we validate the design choices in our analytical functions. |
| Researcher Affiliation | Collaboration | Yangyi Chen1,2, Binxuan Huang2, Yifan Gao2, Zhengyang Wang2, Jingfeng Yang2, Heng Ji2 University of Illinois Urbana-Champaign1, Amazon2 EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methods FLP and FLP-M in detail with mathematical equations and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using a third-party tool: "We adopt lm-evaluation-harness (Gao et al., 2023b) for unified evaluation." However, it does not provide any explicit statement about releasing the source code for the methodology described in this paper (FLP or FLP-M), nor does it provide a link to a code repository. |
| Open Datasets | Yes | Pre-Training We use the Red Pajama v1 (Computer, 2023), which consists of 1.2T tokens in total... For general corpus, we use DCLM (Li et al., 2024b), a curated high-quality pre-training corpus... For code data, we use The Stack v2 (Lozhkov et al., 2024), which initially contains over 3B files in 600+ programming and markup languages, created as part of the Big Code project. |
| Dataset Splits | No | The paper mentions curating a "validation dataset" and using specific benchmarks for "Evaluation." However, it does not provide specific training/test/validation splits (e.g., percentages or exact counts) for its main pre-training datasets (Red Pajama, DCLM, The Stack v2) or how these were used to derive the training data for the sampling LMs. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing configurations used for running the experiments. It only discusses model sizes and FLOPs. |
| Software Dependencies | No | The paper mentions using "lm-evaluation-harness (Gao et al., 2023b)" and the "Adam optimizer (Diederik, 2014)" but does not specify version numbers for these or any other key software components, such as programming languages, deep learning frameworks, or operating systems. |
| Experiment Setup | Yes | Table 1: The configurations of the sampling and target LMs with various sizes. HD denotes the hidden dimension, BS denotes the batch size, and LR denotes the learning rate. ... The network is optimized using the regression loss with L2 regularization and the Adam optimizer (Diederik, 2014), employing a learning rate of 0.05 that linearly decays to 0 within 2,000 steps and a weight decay of 0.01. |