Learning to Quantize for Training Vector-Quantized Networks

Authors: Peijia Qin, Jianguo Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we first evaluate MQ on image generative modeling, including image reconstruction and image generation, using the VQVAE architecture (van den Oord et al., 2017). The experiments are done on small-scale datasets: CIFAR10 (Krizhevsky et al., 2009) and Celeb A (Liu et al., 2015). We then scale up to a larger experimental setting on FFHQ (Karras et al., 2019) and Image Net (Deng et al., 2009) with VQGAN (Esser et al., 2021), which involves perceptual loss and adversarial loss as task losses. We refer to the resulting methods combined with MQ MQVAE and MQGAN , respectively. For the bi-level optimization algorithm implementation, our code is mainly based on the Betty library (Choe et al., 2023). We release our code at Git Hub 1 for future research.
Researcher Affiliation Academia 1Research Institute of Trustworthy Autonomous Systems and Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China. 2Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China, and also with the Pengcheng Laboratory. Correspondence to: Jianguo Zhang <EMAIL>.
Pseudocode Yes Algorithm 1 Meta-Quantization
Open Source Code Yes For the bi-level optimization algorithm implementation, our code is mainly based on the Betty library (Choe et al., 2023). We release our code at Git Hub 1 for future research. 1https://github.com/t2ance/MQVAE
Open Datasets Yes In this section, we first evaluate MQ on image generative modeling, including image reconstruction and image generation, using the VQVAE architecture (van den Oord et al., 2017). The experiments are done on small-scale datasets: CIFAR10 (Krizhevsky et al., 2009) and Celeb A (Liu et al., 2015). We then scale up to a larger experimental setting on FFHQ (Karras et al., 2019) and Image Net (Deng et al., 2009) with VQGAN (Esser et al., 2021)
Dataset Splits Yes For Celeb A, images undergo random cropping to 140 140 pixels, followed by resizing the smaller dimension to 128 while maintaining aspect ratio. No additional augmentations are used for CIFAR10. [...] For Image Net-1K, input images were processed at 128 128 pixels. This involved resizing the image s smaller dimension to 128 while maintaining aspect ratio, followed by a 128 128 random crop and a 50% probability horizontal flip. FFHQ images were processed at 256 256 pixels directly. [...] Codebook usage is determined by the fraction of codewords utilized at least once when encoding the validation set, following (Mentzer et al., 2024) and (Zhu et al., 2024b). Additionally, generation FID is reported for the second stage involving a trained transformer, obtained by decoding representations sampled (potentially class-conditionally) with the transformer.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running experiments. It only mentions memory usage in general terms.
Software Dependencies No The paper mentions software like the 'Betty library' and 'Py Torch' and the 'Adam optimizer' but does not specify any version numbers for these components, which is necessary for reproducible dependency information.
Experiment Setup Yes The backbone architecture of the autoencoder for all compared methods follows Takida et al. (2022) with 64 channels. A codebook size of 1024 is used across all methods. For the hyper-net configuration, an MLP is used. It first lifts the 32-dimensional learned embedding to 256 dimensions, then projects it back to 32 dimensions to form the codebook entries for quantization, using Tanh as the activation function. Models are trained using Mean Squared Error (MSE) as the reconstruction loss; no perceptual or discriminative losses are used. We employ the Adam optimizer (Kingma & Ba, 2015) with momentum set to (0.9, 0.95) and an initial learning rate of 1e-4. The learning rate follows a linear warmup for the first 10% of epochs, then a half-cycle cosine decay. No weight decay is applied to the quantizer. Training runs for a maximum of 90 epochs, with early stopping if performance saturates. [...] The number of codes is set to 65, 536 for Image Net and 16, 384 for FFHQ, consistent with comparison methods. For initialization, codebook embeddings are initialized using a simple Gaussian distribution, avoiding the need for a pre-trained model as in Zhu et al. (2024a). The hyper-net transformation uses Py Torch s default initialization. We train for 20 epochs on Image Net-1K and 800 epochs on FFHQ, employing early stopping if performance saturates. The Adam optimizer (Kingma & Ba, 2015) with optimizer momentum of (0.5, 0.9) is used with an initial learning rate of 1e-4. The learning rate undergoes a linear warmup for the first 10% of epochs, followed by a half-cycle cosine decay schedule.