reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

Authors: Wonsuk Jang, Thierry Tambe

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Block Dialect achieves 10.78 % (7.48 %) accuracy gain on the LLa MA3-8B (LLa MA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. We evaluate Block Dialect on three LLMs: LLa MA-2-7B (Touvron et al., 2023), LLa MA3-8B (Dubey et al., 2024), and Mistral-7B (Jiang et al., 2023). The evaluation includes seven zero-shot commonsense reasoning tasks: LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), ARC-easy, and ARC-challenge (Clark et al., 2018). We leverage the lm-eval-harness (Gao et al., 2023) framework, with 0-shot notation representing the average accuracy across seven tasks. Additionally, we report perplexity scores on Wiki Text2 (Merity et al., 2016) with a chunk of 2048.
Researcher Affiliation	Academia	1Department of Electrical Engineering, Stanford University, CA, USA.
Pseudocode	No	The paper describes methods and processes (e.g., 'Two-Stage Dialect Selection Process' and 'How Should Online Quantization and MAC Operations be Performed?' with figures), but it does not include a distinct, structured pseudocode block or algorithm section.
Open Source Code	Yes	For performance evaluation, we implement the Block Dialect emulation framework4 on top of Hugging Face Transformers using Py Torch. All experiments were conducted on a single NVIDIA H100 GPU. 4https://code.stanford.edu/tambe-lab/ blockdialect
Open Datasets	Yes	We evaluate Block Dialect on three LLMs: LLa MA-2-7B (Touvron et al., 2023), LLa MA3-8B (Dubey et al., 2024), and Mistral-7B (Jiang et al., 2023). The evaluation includes seven zero-shot commonsense reasoning tasks: LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), ARC-easy, and ARC-challenge (Clark et al., 2018). We leverage the lm-eval-harness (Gao et al., 2023) framework, with 0-shot notation representing the average accuracy across seven tasks. Additionally, we report perplexity scores on Wiki Text2 (Merity et al., 2016) with a chunk of 2048.
Dataset Splits	No	The paper mentions '0-shot notation representing the average accuracy across seven tasks' and 'perplexity scores on Wiki Text2 (Merity et al., 2016) with a chunk of 2048.' While this indicates evaluation methods, it does not explicitly provide details about training, validation, or test dataset splits (e.g., specific percentages, sample counts, or predefined split references).
Hardware Specification	Yes	All experiments were conducted on a single NVIDIA H100 GPU. For hardware comparison, we model multiply-accumulate (MAC) units for various precision levels using System Verilog and synthesize them with Synopsys Design Compiler. The synthesis is performed at 0.5 GHz using the Nangate 45nm Open Cell Library to estimate area and power. Each MAC unit is sized to iteratively add 64 terms from a dot product. For additional prototype hardware cost analysis, we synthesize the design using the Sky Water 130nm standard cell library, targeting a clock frequency of 100 MHz.
Software Dependencies	No	The paper mentions 'Hugging Face Transformers using Py Torch', 'lm-eval-harness (Gao et al., 2023) framework', 'System Verilog', 'Synopsys Design Compiler', 'Nangate 45nm Open Cell Library', and 'Sky Water 130nm standard cell library' but does not provide specific version numbers for any of these software components or libraries.
Experiment Setup	Yes	We evaluate Block Dialect on three LLMs: LLa MA-2-7B (Touvron et al., 2023), LLa MA3-8B (Dubey et al., 2024), and Mistral-7B (Jiang et al., 2023). The evaluation includes seven zero-shot commonsense reasoning tasks... we report perplexity scores on Wiki Text2 (Merity et al., 2016) with a chunk of 2048. Block size is 32 unless otherwise specified. We set (search interval, search round) to (60, 2) in LLM-FP4 to avoid excessive calibration time, observing negligible LLa MA7B accuracy loss compared to the original paper. We experiment with various migration strengths (α), controlling the aggressiveness of this shift with a granularity of 0.05, and select the most effective one with the lowest perplexity.