reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Image Compression with Product Quantized Masked Image Modeling

Authors: Alaaeldin El-Nouby, Matthew J. Muckley, Karen Ullrich, Ivan Laptev, Jakob Verbeek, Herve Jegou

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first present our experimental setup in Section 4.1 and the present our results in Section 4.2. We provide ablation studies in Section 4.3. We will share our code and models.
Researcher Affiliation	Collaboration	Alaaeldin El-Noubyµ, , EMAIL Matthew Muckleyµ EMAIL Karen Ullrichµ EMAIL Ivan Laptev , EMAIL Jakob Verbeekµ EMAIL Hervé Jégouµ EMAIL µMeta AI, FAIR Team, Paris, ENS, PSL University, INRIA, Paris
Pseudocode	No	The paper describes methods in prose and with figures, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	We will share our code and models.
Open Datasets	Yes	We train our models using Image Net (Deng et al., 2009). For data augmentation, we apply random resized cropping to 256 256 images and horizontal ﬂipping. For evaluation and comparison to prior work, we use Kodak (Kodak, 1993) and Tecnick (Asuni & Giachetti, 2014) datasets for PSNR and MS-SSIM. Moreover, we compute the perceptual metrics (FID (Heusel et al., 2017), KID (BiÒkowski et al., 2018)) for perceptually trained models using the CLIC 2020 test-set (Toderici et al., 2020) (428 images) using the same patch cropping scheme detailed by Mentzer et al. (2020).
Dataset Splits	Yes	We train our model using Image Net (Deng et al., 2009) for 50 epochs with a batch size of 256. For evaluation and comparison to prior work, we use Kodak (Kodak, 1993) and Tecnick (Asuni & Giachetti, 2014) datasets for PSNR and MS-SSIM. Moreover, we compute the perceptual metrics (FID (Heusel et al., 2017), KID (BiÒkowski et al., 2018)) for perceptually trained models using the CLIC 2020 test-set (Toderici et al., 2020) (428 images) using the same patch cropping scheme detailed by Mentzer et al. (2020).
Hardware Specification	Yes	In contrast to PQ-MIM, raster-scan models require causal attention, which makes XCi T not a good ﬁt. We use a standard Vi T model instead. However, due to the quadratic complexity of Vi T and the high resolution of images typically used for evaluation of compression method (e.g. Tecnick), our autoregressive variant consistently exceeded the memory limits, even when using A100 GPUs with 40GB memory.
Software Dependencies	No	For entropy coding, we use the implementation of the torchac1 arithmetic coder. 1https://github.com/fab-jul/torchac
Experiment Setup	Yes	For all our experiments we ﬁx the codebook size V = 256 and only vary the number of sub-vectors M " {2, 4, 6} for two diﬀerent down-sampling factors f " {8, 16}. Our PQ-VAE training uses the straight-through estimator (Bengio et al., 2013) to propagate gradients through the quantization bottleneck. We train our model using Image Net (Deng et al., 2009) for 50 epochs with a batch size of 256. We use an Adam W (Loshchilov & Hutter, 2019) optimizer with a peak learning rate of 1.10 3, weight decay of 0.02 and 2 = 0.95. We apply a linear warmup for the ﬁrst 5 epochs of training followed by a cosine decay schedule for the remaining 45 epochs to a minimum learning rate of 5.10 5. Unless mentioned otherwise, for all experiments, the encoder and decoder use an XCi T-L6 with 6 layers and hidden dimension of 768. We use sinusoidal positional embedding (Vaswani et al., 2017) such that our model can ﬂexibly operate on variable sized images. As for the models trained with perceptual objectives (Figure 6 and Table 1), they are trained with a weighted sum of MSE, LPIPS ( = 1) and adversarial loss ( = 0.1). We use a Projected GAN Discriminator (Sauer et al., 2021) architecture. The perceptual training is initialized with an MSE only trained checkpoint and trained for 50 epochs using Image Net with a learning rate of 10 4 and weight decay of 5.10 5. Additionally, we ﬁnd that clipping the gradient norm to a maximum value of 4.0 improves the training stability. Our MIM module is an XCi T-L12 with 12 layers and embedding dimension of 768. By default we use S = 5 stages. All stages are processed with the same MIM model. The masked patches are replaced by a learnable mask token embedding. The loss is computed only for the masked patches.