reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Inference-Efficient Language Models

Authors: Song Bian, Minghao Yan, Shivaram Venkataraman

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training 63 models.
Researcher Affiliation	Academia	1Department of Computer Sciences, University of Wisconsin-Madison, Madison WI, USA.
Pseudocode	No	The paper describes a methodology in Section 2.3 and Figure 6, but it is presented as descriptive text and a flowchart, not in a structured pseudocode or algorithm block format.
Open Source Code	Yes	The training code is available at https://github. com/Waterpine/open-lm-morph. The Morph-1B model checkpoint is available at https://huggingface.co/ Naive User/morph-1b.
Open Datasets	Yes	The models are trained on uniformly sampled subsets of DCLM-Baseline (Li et al., 2024)... We evaluate the downstream task accuracy of models derived from the methodology outlined in 2.3 using the following datasets: ARC-Easy (Clark et al., 2018), ARC-Challenge (Clark et al., 2018), Bool Q (Clark et al., 2019), COPA (Roemmele et al., 2011), Hella Swag (Zellers et al., 2019), LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), MMLU (Hendrycks et al., 2020), Jeopardy (Jeo, 2022), and Winograd (Levesque et al., 2012).
Dataset Splits	Yes	We use LLM-foundry (llm, 2024) along with a zero-shot evaluation approach to evaluate model performance on downstream tasks. We evaluate the downstream task accuracy of models derived from the methodology outlined in 2.3 using the following datasets: ARC-Easy (Clark et al., 2018), ARC-Challenge (Clark et al., 2018), Bool Q (Clark et al., 2019), COPA (Roemmele et al., 2011), Hella Swag (Zellers et al., 2019), LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), MMLU (Hendrycks et al., 2020), Jeopardy (Jeo, 2022), and Winograd (Levesque et al., 2012).
Hardware Specification	Yes	The inference latency is collected by using the Hugging Face generate function on a single NVIDIA Ampere 40GB A100 GPU... All evaluations were performed using the Hugging Face generate function on a single NVIDIA Ampere 40GB A100 GPU... All evaluations were performed using the Hugging Face generate function on a single NVIDIA A30 Tensor Core GPU... We first evaluate the inference efficiency of open-source large language models over v LLM using NVIDIA Tesla A100 Ampere 40 GB GPU.
Software Dependencies	No	The paper mentions using 'Hugging Face' (Wolf, 2019), 'LLM-foundry' (llm, 2024), 'GPTNeo X' (Black et al., 2022) as a tokenizer, and the 'Adam W optimizer' with 'bfloat16 precision'. However, specific version numbers for these software components or other libraries are not provided.
Experiment Setup	Yes	Table 3. Hyperparameters: We show the hyperparameters used for training in this paper. In addition, the batch size is the global batch size and the default sequence length is 2048. ... All models are trained in bfloat16 precision using the Adam W optimizer.