Differentiable Model Compression via Pseudo Quantization Noise
Authors: Alexandre Défossez, Yossi Adi, Gabriel Synnaeve
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally verify that our method is competitive with STE based quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation. For instance, on the Image Net dataset, Diff Q compresses a 12 layers transformer-based model by more than a factor of 8, (lower than 4 bits precision per weight on average), with a loss of 0.3% in model accuracy. Code is available at github.com/facebookresearch/diffq. |
| Researcher Affiliation | Industry | Alexandre Défossez EMAIL Meta AI, FAIR Team, Paris, France Yossi Adi EMAIL Meta AI, FAIR Team, Tel-Aviv, Israel Gabriel Synnaeve EMAIL Meta AI, FAIR Team, Paris, France |
| Pseudocode | No | The paper describes methods using mathematical formulations and descriptive text but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at github.com/facebookresearch/diffq. Both experimental code, and a generic framework usable with any architecture in just a few lines, is available on our Github github.com/facebookresearch/diffq. |
| Open Datasets | Yes | For instance, on the Image Net dataset (Deng et al., 2009)... We trained a 16 layers transformer (Vaswani et al., 2017) based language model on the Wikitext-103 text corpus (Merity et al., 2016)... The model is trained on the standard Mus DB benchmark (Rafii et al., 2017)... We evaluated three image classification benchmarks: Image Net Deng et al. (2009), CIFAR-10 and CIFAR-100 Krizhevsky et al. (2009). |
| Dataset Splits | Yes | We trained a 16 layers transformer (Vaswani et al., 2017) based language model on the Wikitext-103 text corpus (Merity et al., 2016)... The model is trained on the standard Mus DB benchmark (Rafii et al., 2017)... Image Net results are reported using Efficient Net-B3 Tan & Le (2019) and Dei T-B Touvron et al. (2020) models. |
| Hardware Specification | Yes | At evaluation time, decompressing the Demucs model from its variable bitwidth compact representation takes around 2.81 seconds on a Mac Book Pro with 2.4 GHz 8 cores Intel i9 processor. |
| Software Dependencies | No | The paper mentions using Py Torch native support (Paszke et al., 2019), the Fairseq framework (Ott et al., 2019), and the ZLib library. However, it does not specify version numbers for any of these software components. |
| Experiment Setup | Yes | All hyper-parameters for optimization and model definition are detailed in the Appendix. The trainable parameter l is initialized so that b = binit. We set binit = 8. We compare to the Quant-Noise method by Fan et al. (2021), but use a reduced layer-drop (Fan et al., 2019) of 0.1 instead of 0.2. We use the Demucs architecture by Défossez et al. (2019) with 64 initial hidden channels. The model is trained on the standard Mus DB benchmark (Rafii et al., 2017), for 180 epochs. Diff Q (λ=5, g=16), Diff Q (λ=10, g=16), Diff Q (λ=3e-4), Diff Q (λ=1e-2), Diff Q (λ=0.1) are specific hyperparameter settings. We additionally evaluate the affect of the group-size, g, on model size and accuracy, by optimizing Diff Q models using g {1, 4, 8, ∞}. |