Image and Video Tokenization with Binary Spherical Quantization
Authors: Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness of BSQ-Vi T on visual reconstruction and compression benchmarks. On image reconstruction, our model archives a state-of-the-art visual reconstruction quality by both pixel-level and semantic metrics. In particular, our best-performing BSQ-Vi T achieves a reconstruction FID of 0.41 on Image Net-1k val, a 43% reduction compared to the runner-up (SDXLVAE (Podell et al., 2023)), while being 2.4 faster. On video reconstruction, our best model reduces FVD on UCF-101 by more than half (8.62 4.10). By further learning an autoregressive prior for adaptive arithmetic coding, BSQ-Vi T achieves comparable results on video compression with conventional compression standards, e.g. H.264 and HEVC. By learning a masked language model, BSQ-Vi T enables image generation with similar quality to Big GAN (Brock et al., 2018) and ADM (Dhariwal & Nichol, 2021). Code is available at https://github.com/ zhaoyue-zephyrus/bsq-vit. |
| Researcher Affiliation | Collaboration | Yue Zhao UT Austin EMAIL Yuanjun Xiong MThreads AI EMAIL Philipp Kr ahenb uhl UT Austin EMAIL |
| Pseudocode | No | The paper describes algorithms (e.g., Binary Spherical Quantization, Transformer-based Visual Tokenizer) but does not present them in clearly labeled pseudocode or algorithm blocks. It explains procedures in narrative text and mathematical formulations. |
| Open Source Code | Yes | Code is available at https://github.com/zhaoyue-zephyrus/bsq-vit. |
| Open Datasets | Yes | Image Net-1k has 1.28M training images and 50,000 validation images; COCO 2017val has 5,000 images. UCF101 has 13,320 video clips and three train-val splits. Following prior works (Yu et al., 2023), we consider split-1 which has 9,537 clips for training and 3,783 for validation. The MCL-JCV dataset (Wang et al., 2016) consists of thirty 1080P (1,920 1,080) video sequences with 24 30 FPS. The Open Ultra Video Group (UVG) dataset (Mercat et al., 2020) consists of sixteen 4K (3,840 2,160) test video sequences captured at 50/120 FPS. Following prior works (Agustsson et al., 2020), we report the performance on a subset of seven videos in YUV 8bit format at 120 FPS under the resolution of 1,920 1,080. We compare BSQ with two image compression standards including JPEG2000 and Web P on the Kodak image dataset in Table 4a. |
| Dataset Splits | Yes | Image Net-1k has 1.28M training images and 50,000 validation images; COCO 2017val has 5,000 images. UCF101 has 13,320 video clips and three train-val splits. Following prior works (Yu et al., 2023), we consider split-1 which has 9,537 clips for training and 3,783 for validation. |
| Hardware Specification | Yes | The hardware for training is 8 GPU-servers with NVIDIA A5000 (24GB). Pre-training an image tokenizer and fine-tuning a video tokenizer in the full schedule is done across two servers with distributed training and takes around 5 days. Training the AR model for AC is done on an 8 GPU server and takes around 1 week. When measuring the tokenizer s throughput and the compression runtime, we use a server with 4 A5000 GPU and 1 AMD Ryzen Threadripper PRO 5975WX 32-Core CPU (64 threads). |
| Software Dependencies | No | The paper mentions using FFmpeg and Adam W optimizer, but does not provide specific version numbers for these or any other software libraries or programming languages used in the implementation. |
| Experiment Setup | Yes | We train the image tokenizer with a batch size of 32 per GPU. We use Adam W optimizer (Loshchilov & Hutter, 2019) with (β1, β2) = (0.9, 0.99) with 1 10 4 weight decay. The base learning rate is 4 10 7 (or a total learning rate of 1 10 4) and follows a halfperiod cosine annealing schedule. The model is trained for 1M steps which amounts to 200 epochs over the entire Image Net-1k training set. We did not heavily study the effect of loss weights. Instead, we keep γ = 1 in the entropy terms. We use a perceptual loss weight of 0.1 and an adversarial loss weight of 0.1 throughout the experiments. We finetune the video tokenizer with a batch size of 32 per GPU. The optimization schedule follows the image-based one but trains for fewer iterations. The network is initialized from the Image Net-pretraining checkpoint and undergoes another 500K steps which amounts to 1600 epochs over UCF-101 split-1 train. The masked LM is a standard post-LN Transformer with 24 layers and a hidden dimension of 768 following Mask GIT (Chang et al., 2022). We train the masked LM on 2 nodes of 8 GPUs (16 in total) with a total batch size of 1024 for 1M steps. We use Adam W optimizer with (β1, β2) = (0.9, 0.96) with 0.045 weight decay. At inference time, we use a cosine unmasking schedule in Mask GIT (Chang et al., 2022) and set the sampling temperature to 15. We use classifier-free guidance (Ho & Salimans, 2022): At training, we replace 20% of the class condition labels with the mask token so that the model learns an unconditional distribution simultaneously. Let ℓc be class-conditioned logits and ℓ be unconditional logits. During inference, we interpolate logits using ℓ = ℓc + α(ℓc ℓ ), where α = 0.5. The auto-regressive model is a Transformer with 24 layers and a hidden dimension 768. We train this model on 8 GPUs with a total batch size of 64. We use Adam W optimizer with (β1, β2) = (0.9, 0.96) with 0.045 weight decay. |