CAMSIC: Content-aware Masked Image Modeling Transformer for Stereo Image Compression
Authors: Xinjie Zhang, Shenyuan Gao, Zhening Liu, Jiawei Shao, Xingtong Ge, Dailan He, Tongda Xu, Yan Wang, Jun Zhang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our stereo image codec achieves state-of-the-art rate-distortion performance on two stereo image datasets Cityscapes and In Stereo2K with fast encoding and decoding speed. |
| Researcher Affiliation | Collaboration | Xinjie Zhang1,2*, Shenyuan Gao1, Zening Liu1, Jiawei Shao1,3, Xingtong Ge2, Dailan He4, Tongda Xu5, Yan Wang5, Jun Zhang1 1The Hong Kong University of Science and Technology 2Sense Time Research 3Institute of Artificial Intelligence (Tele AI), China Telecom 4The Chinese University of Hong Kong 5Institute for AI Industry Research (AIR), Tsinghua University |
| Pseudocode | No | The paper describes methods and processes in paragraph form and through figures, but it does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The text mentions using external tools and libraries like "Compress AI (B egaint et al. 2020)" and mentions that they "follow their open source library (W odlinger et al. 2024) to train the ECSIC models" for benchmarks. However, it does not explicitly state that the authors' own implementation code for CAMSIC is open-sourced or provide a link to it. |
| Open Datasets | Yes | Dataset. We evaluate the coding efficiency of our proposed CAMSIC on two public stereo image datasets: (1) Cityscapes (Cordts et al. 2016): A dataset of urban outdoor scenes with distant views. (2) In Stereo2K (Bao et al. 2020): A dataset of indoor scenes with close views. |
| Dataset Splits | Yes | (1) Cityscapes (Cordts et al. 2016): ... It includes 5000 image pairs at a resolution of 2048 1024 pixels. We divide these into 2975 pairs for training, 500 pairs for validation, and the remaining 1525 pairs for testing. (2) In Stereo2K (Bao et al. 2020): ... It contains 2060 image pairs at a resolution of 1080 860 pixels. We allocate 2010 and 50 stereo image pairs for training and testing, respectively. |
| Hardware Specification | Yes | Our experiments are conducted with NVIDIA V100 GPUs using Py Torch. |
| Software Dependencies | Yes | We run HM-18.0 and VTM23.0 software with lowdelay P configuration and YUV444 format to evaluate the coding efficiency of HEVC and VVC. |
| Experiment Setup | Yes | Leveraging Compress AI (B egaint et al. 2020), we train our models with 6 different λ values (256, 512, 1024, 2048, 4096, 8192 for the MSE metric; 8, 16, 32, 64, 128, 256 for the MS-SSIM metric). For MSEoptimized models, they are trained for 400 epochs with the Adam optimizer (Kingma and Ba 2015). The batch size is set as 4. The initial learning rate is 1e 4 and decayed by a factor of 2 every 100 epochs. ... For MS-SSIM evaluation, the MSE-optimized models are fine-tuned for 300 epochs using the MS-SSIM distortion loss with the initial learning rate as 5e 5. During inference, we set the number of decoding steps as 8. |