Quantization Aware Factorization for Deep Neural Network Compression
Authors: Daria Cherniuk, Stanislav Abukhovich, Anh-Huy Phan, Ivan Oseledets, Andrzej Cichocki, Julia Gusak
JAIR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compress neural network weights with a devised algorithm and evaluate it s prediction quality and performance. We compare our approach to state-of-the-art post-training quantization methods and demonstrate competitive results and high flexibility in achiving a desirable quality-performance tradeoff. |
| Researcher Affiliation | Academia | Daria Cherniuk EMAIL Stanislav Abukhovich EMAIL Anh-Huy Phan EMAIL Ivan Oseledets EMAIL Andrzej Cichocki EMAIL Skolkovo Institute of Science and Technology, Russia Julia Gusak EMAIL Inria, University of Bordeaux, France |
| Pseudocode | Yes | Algorithm 1 Solve 9 using ADMM Algorithm 2 Solve (15) using ADMM-EPC |
| Open Source Code | Yes | Our source code and experiments are available as an open github repository1. 1. https://github.com/Kamikazi Zen/admm-quantization |
| Open Datasets | Yes | We use pretrained Res Net18 model from torchvision model zoo2 and Image Net3 dataset to evaluate our method. 2. https://pytorch.org/vision/stable/models.html 3. https://www.image-net.org/ |
| Dataset Splits | Yes | Batch Norm calibration. Since Resnet models have a lot of Batch Norm layers, and both factorization and quantization disturb the distribution of outputs, calibrating Batch Norm layers parameters (performing inference with no gradients or activations accumulation) can boost model prediction quality. The same procedure was used, for example, in Ada Quant(Hubara et al., 2020). In all our experiments we perform calibration on 2048 samples from training dataset. It is the same number of samples as used in Ada Round(Nagel et al., 2020) and Ada Quant. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It only mentions target devices (mobile or embedded) but not the experimental setup hardware. |
| Software Dependencies | No | The paper mentions using a "pretrained Res Net18 model from torchvision model zoo" which implies PyTorch and torchvision are used, but specific version numbers for these or any other software dependencies are not provided. |
| Experiment Setup | Yes | Metrics. In some ablation studies we compute a quantized reconstruction error metric: equant = X(2) ˆB( ˆC ˆA)T F X(2) F (17) where ˆA, ˆB, ˆC are quantized factors of tensor X. X(2) is a matrix unfolding by mode 2. We could use any other unfolding, the purpose of this notation is merely to address the fact that Frobenius norm is formulated for matrices. Taking into consideration benefits of reducing both MAC operations (factorization) and operand s bit-width (quantization), we adopt BOP metric, introduced in (van Baalen et al., 2020) and use it to compare our method with other approaches. For layer l, BOP count is computed as follows: BOPs(l) = MACs(l)bwba (18) where bw is a bit-width of weights and ba is a bit-width of (input) activations. While computing this metric for the whole model we take into consideration all layers including Batch Norms. Factorization ranks. To choose a rank for each layer factorization, we set up a parameter reduction rate (i.e. the ratio of the number of the factorized layer s parameters over that of the original layer) and define the rank as N/(n + m)/rate for fully-connected layers and N/(n + m + k)/rate for convolutions. N is the number of parameters in the original layer, n, m and k dimensions of reshaped convolution weights (Section 2.1 ), rate denotes parameter reduction rate. Batch Norm calibration. Since Resnet models have a lot of Batch Norm layers, and both factorization and quantization disturb the distribution of outputs, calibrating Batch Norm layers parameters (performing inference with no gradients or activations accumulation) can boost model prediction quality. The same procedure was used, for example, in Ada Quant(Hubara et al., 2020). In all our experiments we perform calibration on 2048 samples from training dataset. |