Instance-Adaptive Video Compression: Improving Neural Codecs by Training on the Test Set

Authors: Ties van Rozendaal, Johann Brehmer, Yunfan Zhang, Reza Pourreza, Auke J. Wiggers, Taco Cohen

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On UVG, HEVC, and Xiph datasets, our codec improves the performance of a scale-space flow model by between 21 % and 27 % BD-rate savings, and that of a state-of-the-art B-frame model by 17 to 20 % BD-rate savings. We also demonstrate that instanceadaptive finetuning improves the robustness to domain shift. Finally, our approach reduces the capacity requirements of compression models. We show that it enables a competitive performance even after reducing the network size by 70 %. Figures 1 show the rate-distortion curves of our instance-adaptive video codec (Inst A) as well as neural and traditional baselines. Both for SSF in the P-frame and B-EPIC in the B-frame setting, the instance-adaptive models clearly outperform the corresponding base models.
Researcher Affiliation Industry Ties van Rozendaal Johann Brehmer Yunfan Zhang Reza Pourreza Auke Wiggers Taco S. Cohen Qualcomm AI Research1 EMAIL 1Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
Pseudocode No The paper describes the encoding procedure in a numbered list within section 3.4, but it is presented as natural language steps rather than structured pseudocode or an algorithm block. For example: "Our procedure follows van Rozendaal et al. (2021) and mainly differs in the choice of hyper-parameters and application to video auto-encoder models. For completeness we shall describe the full method here. A video sequence x is compressed by: 1. Finetuning the model parameters (θ, φ) of the base model on the sequence x using Eq. (2), 2. computing the latent codes z qφ(z|x), 3. parameterizing the finetuned decoder and prior parameters as updates δ = θ θD, 4. quantizing latent codes z as well as prior and decoder parameter updates δ, and 5. compressing the quantized latents z and updates δ with entropy coding to the bitstream."
Open Source Code No The paper states: "Finally, in the supplementary material we provide the performance of the various methods on each video sequence as a CSV file to aid comparisons." This refers to data, not source code. There is no explicit statement about releasing code or a link to a code repository for the methodology described.
Open Datasets Yes We use sequences from five different datasets. The global models are trained on Vimeo90k (Xue et al., 2019). We evaluate on the HEVC class-B test sequences (HEVC, 2013), on the UVG-1k (Mercat et al., 2020) dataset, and on Xiph-5N (van Rozendaal et al., 2021), which entails five Xiph.org test sequences. The performance on out-of-distribution data is tested on two sequences from the animated short film Big Buck Bunny, also part of the Xiph.org collection (Xiph.org).
Dataset Splits Yes The models are trained with a Go P size of 3 frames, which means that we split the training video into chunks of 3 frames and randomly sample chunks during training. We finally evaluate the models with a Go P size of 12. Further increasing the Go P size leads to diminishing returns in rate-distortion performance, as we demonstrate in Appendix D. For the B-EPIC model, we use the model trained by Pourreza & Cohen (2021) on Vimeo-90k. The setup is similar to that for the SSF models except that B-EPICs more complicated Go P structure requires training with a Go P size of 4 frames. At test time we use a Go P size of 12, the frame configurations are described in Appendix B. In the P-frame scenario we use a Go P size of 3 and finetune on full-resolution frames (1920 1080 pixels) with a batch size of 1 and a learning rate of 10 5. After finetuning, we transmit sequences with a Go P size of 12.
Hardware Specification Yes We report the walltime on machines with 40-core Intel Xeon Gold 6230 CPUs with 256 GB RAM and NVIDIA Tesla V100-SXM2 GPUs with 32 GB VRAM. We only use a single GPU.
Software Dependencies Yes We generate H.265 and H.264 results using version v3.4.8 of ffmpeg (FFmpeg). We are grateful to the authors and maintainers of ffmpeg (FFmpeg), Matplotlib (Hunter, 2007), Numpy (Charles R Harris et al., 2020), Open CV (Bradski, 2000), pandas (Mc Kinney, 2010), Python (Python core team, 2019), Py Torch (Paszke et al., 2017), scipy (Sci Py contributors, 2020), and seaborn (Waskom, 2021).
Experiment Setup Yes The scale-space flow models described in Sec. 3 are trained with the MSE training setup described in Agustsson et al. (2020)... We first train for 1 million steps on 256 256 crops with a learning rate of 10 4. We then conduct the MSE finetune stage of the training procedure from Agustsson et al. (2020) (not to be confused with instance-adaptive finetuning) for the SSF18 model, where we train on crops of size h w = 256 384 with a learning rate of 10 5. The models are trained with a Go P size of 3 frames... On each instance, we finetune the models with the Inst A objective in Eq. (2), using the same weight β as used to train the corresponding global model. We finetune for up to two weeks, corresponding to an average of 300 000 steps. In the P-frame scenario we use a Go P size of 3 and finetune on full-resolution frames (1920 1080 pixels) with a batch size of 1 and a learning rate of 10 5. To discretize the updates δ, we use a fixed grid of n equal-sized bins of width t centered around δ = 0 and clip values at the tails. The quantization of z is analogous, except that we use a bin width of t = 1 and do not clip the values at the tails (in line with Ballé et al. (2018)). For the experiments in this paper we use a bin width t = 0.001, σ = 0.05, s = t/6, a spike-slab ratio α = 100, and a number of quantization bins of n = 289.