Progressive Mixed-Precision Decoding for Efficient LLM Inference

Authors: Hao (Mark) Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Hongxiang Fan, Stylianos Venieris

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluation across diverse language tasks shows that when targeting Nvidia GPUs, PMPD achieves 1.4 12.2 speedup in LLM linear layers over fp16 models and up to 1.41 over uniform quantization. When targeting an LLM-optimized NPU, our approach delivers a throughput gain of 3.8 8.0 over fp16 models and up to 1.54 over uniform quantization approaches while preserving the output quality. Our code is available at github.com/Samsung Labs/PMPD. ... Section 5 EVALUATION Models and Datasets. We conducted experiments on edge-deployable models, including Vicuna7B (Chiang et al., 2023), Mobile LLa MA-1.4B (Chu et al., 2023), Stable LM Zephyr-3B1, and Phi1.5 (Li et al., 2023b), evaluating their zero-shot generative performance on news summarization, dialogue summarization, and translation tasks using the CNN/DM (Hermann et al., 2015), Dialogsum (Chen et al., 2021), and IWSLT French-English datasets (Cettolo et al., 2017), respectively.
Researcher Affiliation Collaboration Hao (Mark) Chen1,2, Fuwen Tan1, Alexandros Kouris1, Royson Lee1, Hongxiang Fan1,2, Stylianos I. Venieris1 1Samsung AI Center, Cambridge, UK 2Imperial College London, UK EMAIL, EMAIL EMAIL, EMAIL
Pseudocode Yes Algorithm 1 presents our end-to-end methodology, consisting of: i) the offline stage (lines 1-8) and ii) deployment (lines 9-22). Algorithm 1: Progressive Mixed-Precision Decoding
Open Source Code Yes Our code is available at github.com/Samsung Labs/PMPD.
Open Datasets Yes We conducted experiments on edge-deployable models, including Vicuna7B (Chiang et al., 2023), Mobile LLa MA-1.4B (Chu et al., 2023), Stable LM Zephyr-3B1, and Phi1.5 (Li et al., 2023b), evaluating their zero-shot generative performance on news summarization, dialogue summarization, and translation tasks using the CNN/DM (Hermann et al., 2015), Dialogsum (Chen et al., 2021), and IWSLT French-English datasets (Cettolo et al., 2017), respectively. We also tested open-ended question answering on MT-Bench (Chiang et al., 2023)... The Learned scheduler was trained using the first 256 samples from the C4 test dataset as the seed dataset.
Dataset Splits No The paper mentions using a 'validation set' for static scheduler optimization and 'the first 256 samples from the C4 test dataset as the seed dataset' for the learned scheduler. However, it does not explicitly provide specific percentages or exact sample counts for training, validation, and test splits for all datasets used (CNN/DM, Dialogsum, IWSLT, MT-Bench) in the main experimental evaluation. While standard splits might be implied for these benchmarks, they are not detailed.
Hardware Specification Yes GPU Latency. Following Any-Precision LLM (Park et al., 2024), we evaluated the latencies of linear layers in the LLMs across different Nvidia GPUs, including RTX 4090 and A40. ... To estimate the processing speed of PMPD when deployed on an NPU, we developed an analytical performance model of the hardware architecture of Flight LLM (Zeng et al., 2024)... For our experiments, we instantiate two NPU configurations consisting of: 4K and 16K MAC units with 1 GHz clock frequency (i.e. 8 and 16 teraops/sec (TOPS) peak throughput, respectively) for the deployment of smaller(Mobile LLa Ma-1.4B, Phi-1.5) and larger-scale LLMs (Vicuna-7B), respectively, and with 32 GB/s off-chip memory bandwidth.
Software Dependencies No As our implementation was based on Py Torch (Paszke et al., 2019), we employed its built-in Profiler tool and reported the average self CUDA time metric over 100 forward passes to ensure reliable results. The paper mentions PyTorch but does not specify its version or any other key software components with their version numbers.
Experiment Setup Yes The high-precision variant of each model is quantized to the lowest lossless precision determined by perplexity on the C4 dataset, while low precision is defined to be one bit lower than the high precision. ... The Static scheduler finds a schedule that minimizes high-precision steps on each benchmark s validation set while maintaining lossless performance. ... The Learned scheduler was trained using the first 256 samples from the C4 test dataset as the seed dataset. ... During training, we use a cross-entropy loss function as the objective. ... We employed its built-in Profiler tool and reported the average self CUDA time metric over 100 forward passes to ensure reliable results. ... For our experiments, we instantiate two NPU configurations consisting of: 4K and 16K MAC units with 1 GHz clock frequency (i.e. 8 and 16 teraops/sec (TOPS) peak throughput, respectively) for the deployment of smaller(Mobile LLa Ma-1.4B, Phi-1.5) and larger-scale LLMs (Vicuna-7B), respectively, and with 32 GB/s off-chip memory bandwidth.