Scaling Inference-Efficient Language Models

Authors: Song Bian, Minghao Yan, Shivaram Venkataraman

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training 63 models.
Researcher Affiliation Academia 1Department of Computer Sciences, University of Wisconsin-Madison, Madison WI, USA.
Pseudocode No The paper describes a methodology in Section 2.3 and Figure 6, but it is presented as descriptive text and a flowchart, not in a structured pseudocode or algorithm block format.
Open Source Code Yes The training code is available at https://github. com/Waterpine/open-lm-morph. The Morph-1B model checkpoint is available at https://huggingface.co/ Naive User/morph-1b.
Open Datasets Yes The models are trained on uniformly sampled subsets of DCLM-Baseline (Li et al., 2024)... We evaluate the downstream task accuracy of models derived from the methodology outlined in 2.3 using the following datasets: ARC-Easy (Clark et al., 2018), ARC-Challenge (Clark et al., 2018), Bool Q (Clark et al., 2019), COPA (Roemmele et al., 2011), Hella Swag (Zellers et al., 2019), LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), MMLU (Hendrycks et al., 2020), Jeopardy (Jeo, 2022), and Winograd (Levesque et al., 2012).
Dataset Splits Yes We use LLM-foundry (llm, 2024) along with a zero-shot evaluation approach to evaluate model performance on downstream tasks. We evaluate the downstream task accuracy of models derived from the methodology outlined in 2.3 using the following datasets: ARC-Easy (Clark et al., 2018), ARC-Challenge (Clark et al., 2018), Bool Q (Clark et al., 2019), COPA (Roemmele et al., 2011), Hella Swag (Zellers et al., 2019), LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), MMLU (Hendrycks et al., 2020), Jeopardy (Jeo, 2022), and Winograd (Levesque et al., 2012).
Hardware Specification Yes The inference latency is collected by using the Hugging Face generate function on a single NVIDIA Ampere 40GB A100 GPU... All evaluations were performed using the Hugging Face generate function on a single NVIDIA Ampere 40GB A100 GPU... All evaluations were performed using the Hugging Face generate function on a single NVIDIA A30 Tensor Core GPU... We first evaluate the inference efficiency of open-source large language models over v LLM using NVIDIA Tesla A100 Ampere 40 GB GPU.
Software Dependencies No The paper mentions using 'Hugging Face' (Wolf, 2019), 'LLM-foundry' (llm, 2024), 'GPTNeo X' (Black et al., 2022) as a tokenizer, and the 'Adam W optimizer' with 'bfloat16 precision'. However, specific version numbers for these software components or other libraries are not provided.
Experiment Setup Yes Table 3. Hyperparameters: We show the hyperparameters used for training in this paper. In addition, the batch size is the global batch size and the default sequence length is 2048. ... All models are trained in bfloat16 precision using the Adam W optimizer.