Layer-wise Quantization for Quantized Optimistic Dual Averaging

Authors: Anh Duc Nguyen, Ilia Markov, Zhengqing Wu, Ali Ramezani-Kebrya, Kimon Antonakopoulos, Dan Alistarh, Volkan Cevher

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that QODA achieves up to a 150% speedup over the baselines in end-to-end training time for training Wasserstein GAN on 12+ GPUs. ... In Section 7, we provide empirical studies on GANs and Transformer-XL.
Researcher Affiliation Collaboration 1National University of Singapore (NUS)... 2Neural Magic 3Laboratory for Information and Inference Systems (LIONS), École Polytechnique Fédérale de Lausanne (EPFL) 4University of Oslo (Ui O)... 7Institute of Science and Technology Austria (ISTA).
Pseudocode Yes Algorithm 1: Quantized Optimistic Dual Averaging (QODA)
Open Source Code Yes We used the implementation of (Markov et al., 2024) and provide our code in the supplementary material.
Open Datasets Yes We have implemented QODA in Algorithm 1... and train WGAN (Arjovsky et al., 2017) on CIFAR10 and CIFAR100 (Krizhevsky, 2009). ... training Transformer-XL on Wiki Text-103.
Dataset Splits No The paper mentions using well-known datasets like CIFAR10, CIFAR100, and Wiki Text-103 but does not explicitly state the training, validation, or test splits used for these datasets. It refers to 'training recipe' and 'hyperparameters as in the original codebase' but lacks specific split percentages or methodologies.
Hardware Specification Yes In our experiments, we use 4 to 16 nodes, each with a single NVIDIA RTX 3090 GPU, in a multi-node Genesis Cloud environment... We used 8 NVIDIA GH200 120GB GPUs for the experiments here.
Software Dependencies No We use the torch_cgx Pytorch extension (Markov et al., 2022). Moreover, we adapt compression choices layer-wise, following the L-Gre Co (Markov et al., 2024) algorithm. For the communication backend, we pick the best option for quantized and full-precision regimes: Open MPI (ope, 2023) and NCCL (ncc, 2023), respectively. The paper mentions software like PyTorch, Open MPI, and NCCL but does not provide specific version numbers for these components.
Experiment Setup Yes We follow the training recipe of Q-Gen X (Ramezani-Kebrya et al., 2023), where authors set large batch size (1024) and keep all other hyperparameters as in the original codebase of (Gidel et al., 2018). For global and layer-wise compression, we use 5 bits (with bucket size 128), and run the L-Gre Co adaptive compression algorithm every 10K optimization steps for both the generator and discriminator models.