BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference
Authors: Wonsuk Jang, Thierry Tambe
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Block Dialect achieves 10.78 % (7.48 %) accuracy gain on the LLa MA3-8B (LLa MA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. We evaluate Block Dialect on three LLMs: LLa MA-2-7B (Touvron et al., 2023), LLa MA3-8B (Dubey et al., 2024), and Mistral-7B (Jiang et al., 2023). The evaluation includes seven zero-shot commonsense reasoning tasks: LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), ARC-easy, and ARC-challenge (Clark et al., 2018). We leverage the lm-eval-harness (Gao et al., 2023) framework, with 0-shot notation representing the average accuracy across seven tasks. Additionally, we report perplexity scores on Wiki Text2 (Merity et al., 2016) with a chunk of 2048. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering, Stanford University, CA, USA. |
| Pseudocode | No | The paper describes methods and processes (e.g., 'Two-Stage Dialect Selection Process' and 'How Should Online Quantization and MAC Operations be Performed?' with figures), but it does not include a distinct, structured pseudocode block or algorithm section. |
| Open Source Code | Yes | For performance evaluation, we implement the Block Dialect emulation framework4 on top of Hugging Face Transformers using Py Torch. All experiments were conducted on a single NVIDIA H100 GPU. 4https://code.stanford.edu/tambe-lab/ blockdialect |
| Open Datasets | Yes | We evaluate Block Dialect on three LLMs: LLa MA-2-7B (Touvron et al., 2023), LLa MA3-8B (Dubey et al., 2024), and Mistral-7B (Jiang et al., 2023). The evaluation includes seven zero-shot commonsense reasoning tasks: LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), ARC-easy, and ARC-challenge (Clark et al., 2018). We leverage the lm-eval-harness (Gao et al., 2023) framework, with 0-shot notation representing the average accuracy across seven tasks. Additionally, we report perplexity scores on Wiki Text2 (Merity et al., 2016) with a chunk of 2048. |
| Dataset Splits | No | The paper mentions '0-shot notation representing the average accuracy across seven tasks' and 'perplexity scores on Wiki Text2 (Merity et al., 2016) with a chunk of 2048.' While this indicates evaluation methods, it does not explicitly provide details about training, validation, or test dataset splits (e.g., specific percentages, sample counts, or predefined split references). |
| Hardware Specification | Yes | All experiments were conducted on a single NVIDIA H100 GPU. For hardware comparison, we model multiply-accumulate (MAC) units for various precision levels using System Verilog and synthesize them with Synopsys Design Compiler. The synthesis is performed at 0.5 GHz using the Nangate 45nm Open Cell Library to estimate area and power. Each MAC unit is sized to iteratively add 64 terms from a dot product. For additional prototype hardware cost analysis, we synthesize the design using the Sky Water 130nm standard cell library, targeting a clock frequency of 100 MHz. |
| Software Dependencies | No | The paper mentions 'Hugging Face Transformers using Py Torch', 'lm-eval-harness (Gao et al., 2023) framework', 'System Verilog', 'Synopsys Design Compiler', 'Nangate 45nm Open Cell Library', and 'Sky Water 130nm standard cell library' but does not provide specific version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | We evaluate Block Dialect on three LLMs: LLa MA-2-7B (Touvron et al., 2023), LLa MA3-8B (Dubey et al., 2024), and Mistral-7B (Jiang et al., 2023). The evaluation includes seven zero-shot commonsense reasoning tasks... we report perplexity scores on Wiki Text2 (Merity et al., 2016) with a chunk of 2048. Block size is 32 unless otherwise specified. We set (search interval, search round) to (60, 2) in LLM-FP4 to avoid excessive calibration time, observing negligible LLa MA7B accuracy loss compared to the original paper. We experiment with various migration strengths (α), controlling the aggressiveness of this shift with a granularity of 0.05, and select the most effective one with the lowest perplexity. |