Amdahl’s Law for LLMs: A Throughput-Centric Analysis of Extreme LLM Quantization

Authors: Jinendra Malekar, Ramtin Zand

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments across diverse model architectures and hardware platforms, we highlight key trade-offs and performance ceilings, providing a roadmap for future research aimed at maximizing LLM throughput through more holistic quantization strategies.
Researcher Affiliation Academia Jinendra Malekar EMAIL Department of Computer Science and Engineering University of South Carolina Ramtin Zand EMAIL Department of Computer Science and Engineering University of South Carolina
Pseudocode No The paper describes mathematical formulations and architectural diagrams (e.g., Figure 1, Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions using the "cycle-accurate SCALE-Sim framework (Samajdar et al., 2018; 2020)" and that "The Scale-Sim V2 tool was installed", indicating the use of a third-party tool. However, there is no explicit statement by the authors that they are releasing their own code for the methodology described in the paper, nor is a link to a repository provided.
Open Datasets Yes Generative Large language models (LLMs) such as GPT (Radford et al., 2019), OPT (Zhang et al., 2022) and LLa MA (Touvron et al., 2023) have attracted significant attention in recent years because of their impressive performance across various tasks...
Dataset Splits No The paper focuses on analyzing the performance of existing LLM architectures (GPT, OPT, LLaMA) under different quantization schemes and hardware configurations. It does not describe training new models or splitting datasets for training, validation, or testing in its experimental setup, but rather evaluates the models themselves.
Hardware Specification Yes For the hardware, we designed two TPUs tailored for different applications: cloud and edge processing. The cloud TPU features a 256 256 systolic array with 16MB of SRAM, while the edge TPU has a 32 32 systolic array with 8MB of SRAM... All experiments were conducted on a computing setup with a single node, 48 cores, and 200 GB of memory... All experiments were conducted on a computing setup with a single node, 40 cores, NVIDIA Tesla V100 with 32 GB of RAM.
Software Dependencies Yes We utilize the cycle-accurate SCALE-Sim framework (Samajdar et al., 2018; 2020) to measure compute cycles and memory accesses in various LLMs. The Scale-Sim V2 tool was installed, and experiments were executed using specific configurations for both cloud and edge setups.
Experiment Setup Yes For our experiments, we study 13 different LLMs including GPT, OPT, and LLa MA models. Our work assumes W1A8, W2A8 quantization as the baseline quantization schemes throughout the analysis. Specifically, projection weights are represented using binary or ternary quantization (i.e., 1-2 bits for weights), while activations are maintained at 8-bit integer precision (INT8). Table 2 lists all models and their corresponding hyperparameters. ... For the hardware, we designed two TPUs tailored for different applications: cloud and edge processing. The cloud TPU features a 256 256 systolic array with 16MB of SRAM, while the edge TPU has a 32 32 systolic array with 8MB of SRAM. Both systolic arrays employ an OS dataflow. Also, in both designs, 2MB of memory is allocated for internal use... Table 3 provides the memory distribution of both edge and cloud TPU designs.