Image Compression with Product Quantized Masked Image Modeling
Authors: Alaaeldin El-Nouby, Matthew J. Muckley, Karen Ullrich, Ivan Laptev, Jakob Verbeek, Herve Jegou
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first present our experimental setup in Section 4.1 and the present our results in Section 4.2. We provide ablation studies in Section 4.3. We will share our code and models. |
| Researcher Affiliation | Collaboration | Alaaeldin El-Noubyµ, , EMAIL Matthew Muckleyµ EMAIL Karen Ullrichµ EMAIL Ivan Laptev , EMAIL Jakob Verbeekµ EMAIL Hervé Jégouµ EMAIL µMeta AI, FAIR Team, Paris, ENS, PSL University, INRIA, Paris |
| Pseudocode | No | The paper describes methods in prose and with figures, but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | We will share our code and models. |
| Open Datasets | Yes | We train our models using Image Net (Deng et al., 2009). For data augmentation, we apply random resized cropping to 256 256 images and horizontal flipping. For evaluation and comparison to prior work, we use Kodak (Kodak, 1993) and Tecnick (Asuni & Giachetti, 2014) datasets for PSNR and MS-SSIM. Moreover, we compute the perceptual metrics (FID (Heusel et al., 2017), KID (BiÒkowski et al., 2018)) for perceptually trained models using the CLIC 2020 test-set (Toderici et al., 2020) (428 images) using the same patch cropping scheme detailed by Mentzer et al. (2020). |
| Dataset Splits | Yes | We train our model using Image Net (Deng et al., 2009) for 50 epochs with a batch size of 256. For evaluation and comparison to prior work, we use Kodak (Kodak, 1993) and Tecnick (Asuni & Giachetti, 2014) datasets for PSNR and MS-SSIM. Moreover, we compute the perceptual metrics (FID (Heusel et al., 2017), KID (BiÒkowski et al., 2018)) for perceptually trained models using the CLIC 2020 test-set (Toderici et al., 2020) (428 images) using the same patch cropping scheme detailed by Mentzer et al. (2020). |
| Hardware Specification | Yes | In contrast to PQ-MIM, raster-scan models require causal attention, which makes XCi T not a good fit. We use a standard Vi T model instead. However, due to the quadratic complexity of Vi T and the high resolution of images typically used for evaluation of compression method (e.g. Tecnick), our autoregressive variant consistently exceeded the memory limits, even when using A100 GPUs with 40GB memory. |
| Software Dependencies | No | For entropy coding, we use the implementation of the torchac1 arithmetic coder. 1https://github.com/fab-jul/torchac |
| Experiment Setup | Yes | For all our experiments we fix the codebook size V = 256 and only vary the number of sub-vectors M " {2, 4, 6} for two different down-sampling factors f " {8, 16}. Our PQ-VAE training uses the straight-through estimator (Bengio et al., 2013) to propagate gradients through the quantization bottleneck. We train our model using Image Net (Deng et al., 2009) for 50 epochs with a batch size of 256. We use an Adam W (Loshchilov & Hutter, 2019) optimizer with a peak learning rate of 1.10 3, weight decay of 0.02 and 2 = 0.95. We apply a linear warmup for the first 5 epochs of training followed by a cosine decay schedule for the remaining 45 epochs to a minimum learning rate of 5.10 5. Unless mentioned otherwise, for all experiments, the encoder and decoder use an XCi T-L6 with 6 layers and hidden dimension of 768. We use sinusoidal positional embedding (Vaswani et al., 2017) such that our model can flexibly operate on variable sized images. As for the models trained with perceptual objectives (Figure 6 and Table 1), they are trained with a weighted sum of MSE, LPIPS ( = 1) and adversarial loss ( = 0.1). We use a Projected GAN Discriminator (Sauer et al., 2021) architecture. The perceptual training is initialized with an MSE only trained checkpoint and trained for 50 epochs using Image Net with a learning rate of 10 4 and weight decay of 5.10 5. Additionally, we find that clipping the gradient norm to a maximum value of 4.0 improves the training stability. Our MIM module is an XCi T-L12 with 12 layers and embedding dimension of 768. By default we use S = 5 stages. All stages are processed with the same MIM model. The masked patches are replaced by a learnable mask token embedding. The loss is computed only for the masked patches. |