Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Authors: Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammed Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Dov Karpas, Roi Koren, Itay Levy, Pavlo Molchanov, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We showcase our framework s impact via Llama3.1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3.3-Nemotron-49B, two publicly available models derived from Llama-70B-Instruct. Both models achieve a 2.17 inference throughput speedup, fitting on a single NVIDIA H100 GPU while retaining 98.4% of the original model s benchmark accuracies. Our work establishes that powerful LLM models can be optimized for efficient deployment with only negligible loss in quality, underscoring that inference performance, not parameter count alone, should guide model selection. Main Results: Using our Puzzle framework, we generated Nemotron-51B as a child derivative of the Llama-70B model. Nemotron-51B achieves a significant improvement in inference efficiency while retaining nearly all the accuracy of its parent, demonstrating the effectiveness of our approach. Evaluating Model Performance: To evaluate Puzzle-derived child models like Nemotron-51B, two performance metrics are of primary interest: 1) Accuracy Preservation: This measures how much of the parent model s accuracy is retained by the child. 2) Computational Efficiency: This reflects the child model s ability to adhere to the constraints it was optimized for. In our case, the focus is on throughput, showing how we improved the model s suitability for reducing inference cost. Accuracy comparison: Table 1 compares the accuracy of Nemotron-51B with its parent across several benchmarks. Throughput comparison: Table 2 specifies the throughput performance of Nemotron-51B against its parent across diverse input-output sequence lengths. Accuracy vs. throughput frontier: The tradeoff between accuracy and efficiency is key for model selection, impacting deployment costs. Nemotron-51B is designed to balance the two and push beyond the current efficient frontier. Because throughput directly affects cost, Nemotron-51B provides the best accuracy per dollar, as shown in Figure 5. Ablation Studies Highlights: We performed a series of ablations to examine how local distillation, block search strategies, data composition, and search space design each contribute to Puzzle s accuracy efficiency tradeoffs. See Appendix F for full details.
Researcher Affiliation Industry Correspondence to: Akhiad Bercovich, Tomer Ronen, Ran El-Yaniv <EMAIL>.
Pseudocode No The paper describes the 'Puzzle' framework in three stages (Crafting the puzzle pieces, Assembling the puzzle architecture, Uptraining) and details the search algorithm using Mixed-Integer-Programming, but it does not present these as a formal pseudocode block or algorithm figure. It describes the mathematical formulation of the MIP problem in Appendix B, but this is a mathematical formulation, not pseudocode.
Open Source Code Yes Our contributions: (1) ... A demo of Puzzle is available at: https://github.com/NVlabs/puzzle/
Open Datasets Yes To ensure broad coverage of diverse data domains within limited training schedules, we curated a dataset mixture, termed Distillation Mix, for all our distillation training runs. This mixture includes source code repositories, Wikipedia articles, books, news websites, and several other domains. The dataset comprises 224 billion tokens collected from three public datasets: Fine Web (Penedo et al., 2024), Dolma (Soldaini et al., 2024), and Buzz-V1.2 (Hive-Digital Technologies). ... We evaluate the robustness of the Puzzle framework to training data composition by comparing performances up through the BLD stage (prior to GKD). For this analysis, we contrast two datasets: our domain-diverse Distillation Mix (described in Section 3) and the English subset of Project Gutenberg (Project Gutenberg).
Dataset Splits No The paper mentions using a 'validation corpus' for LM loss and KL divergence, and for Half-MMLU it splits its 57 tasks into two nearly equal-sized sets, one for training and one for evaluation. It also states: 'In our BLD experiments, we used 1 billion training tokens.' For Nemotron-49B-Base, it mentions 'uptrained on an additional 5B tokens at 64K context and 5B at 128K'. However, it does not provide specific train/test/validation split percentages or exact sample counts for the primary datasets used in the main experiments or general workflow, nor does it refer to standard predefined splits with citations for these main datasets.
Hardware Specification Yes Both models achieve a 2.17 inference throughput speedup, fitting on a single NVIDIA H100 GPU ... Throughput is measured in tokens per second per GPU (NVIDIA H100). TP# indicates the number of GPUs used in tensor parallelism. Note: Results were obtained on NVIDIA H100 SXM GPUs with FP8 quantization for weights, activations and KV cache using Tensor RT-LLM. ... optimized for throughput specifically on an RTX 4090 GPU.
Software Dependencies No The paper mentions 'Tensor RT-LLM' as a highly optimized LLM runtime and the use of the 'open-source python-mip package (Inc., 2023)'. However, explicit version numbers for Tensor RT-LLM or the python-mip package are not provided in the main text.
Experiment Setup Yes The inference engine handled this selection dynamically for each run. For example, Nemotron-51B achieved optimal throughput with TP=1 and batch size 256, while Llama-3.1-70B performed best with TP=4 and batch size 384. ... FP8 quantization for weights, activations and KV cache using Tensor RT-LLM. ... Our experiments, summarized in Table 6, demonstrate that notable accuracy recovery can be realized with reduced token usage. After only 3.7B tokens of GKD, Nemotron-51B recovered 98.8% of its parent s accuracy on MMLU and MT-Bench benchmarks. Similarly, Nemotron-49B regained 99.63% of its parent s accuracy after only 8.68B tokens, and even 98.47% after just 2.9B tokens.