CAT: Compression-Aware Training for bandwidth reduction

Authors: Chaim Baskin, Brian Chmiel, Evgenii Zheltonozhskii, Ron Banner, Alex M. Bronstein, Avi Mendelson

JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental CAT significantly improves the state-of-the-art results reported for quantization evaluated on various vision and NLP tasks, such as image classification (Image Net), image detection (Pascal VOC), sentiment analysis (Co La) , and textual entailment (MNLI). For example, on Res Net-18, we achieve near baseline Image Net accuracy with an average representation of only 1.5 bits per value with 5-bit quantization. Moreover, we show that entropy reduction of weights and activations can be applied together, further improving bandwidth reduction. Reference implementation is available.
Researcher Affiliation Collaboration Chaim Baskin EMAIL Technion Israel Institute of Technology, Haifa, Israel Brian Chmiel EMAIL Habana Labs An Intel company, Caesarea, Israel Evgenii Zheltonozhskii EMAIL Technion Israel Institute of Technology, Haifa, Israel
Pseudocode No The paper describes the proposed methods, such as 'Differentiable Entropy-Reducing Loss' in Section 3.2, using mathematical formulations and descriptive text. However, it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes Reference implementation is available.
Open Datasets Yes We evaluate the proposed scheme on common CNN architectures for image classification (Res Net-18/34/50, Mobile Net V2) on Image Net (Russakovsky et al., 2015) , object detection (SSD5121, Liu et al., 2016) on Pascal VOC (Everingham et al., 2010) as well as Transformers (BERT, Devlin et al., 2019) for sentimental analysis on Co LA (Warstadt et al., 2019) and for textual entailment on MNLI (Williams et al., 2018).
Dataset Splits No The paper mentions using well-known datasets like ImageNet, Pascal VOC, CoLa, and MNLI for evaluation. While it details training procedures and parameters, it does not explicitly provide information regarding specific training/validation/test splits (e.g., percentages, sample counts, or references to predefined splits) for these datasets.
Hardware Specification No The paper mentions general hardware terms such as 'custom hardware' and 'single GPU', but it does not specify exact models of GPUs, CPUs, or other detailed processor/memory specifications used for running the experiments. For example, in Section 4.1.4, it states: 'We noticed that training Res Net-50 on a single GPU mandated the use of small batches, leading to performance degradation.'
Software Dependencies No The paper mentions that 'Our code is based on an implementation by Li (2018)' (which refers to a PyTorch implementation for SSD512). However, it does not provide specific version numbers for any software, libraries, or frameworks used in the methodology (e.g., PyTorch version, Python version, CUDA version).
Experiment Setup Yes For optimization, we used SGD with a learning rate of 10 4, momentum 0.9, and weight decay 4 10 5 for up to 30 epochs (usually, 10 to 15 epochs were sufficient for convergence). Our initial choice of temperature was T = 10, which performed well. We tried to apply exponential scheduling to the temperature (Jang et al., 2017), but it did not have any noticeable effect on the results.