From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Authors: Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, Zongqing Lu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs multimodal understanding capabilities, even with limited training data. Leveraging this method, we develop Being-VL-0, a model that demonstrates superior performance across various benchmarks and shows promising scalability, potentially paving the way for more efficient and capable multimodal foundation models.
Researcher Affiliation Collaboration Wanpeng Zhang1 , Zilong Xie2, Yicheng Feng1, Yijiang Li3, Xingrun Xing4,5, Sipeng Zheng4, Zongqing Lu1,6 1School of Computer Science, Peking University 2The Chinese University of Hong Kong 3University of California, San Diego 4Beijing Academy of Artificial Intelligence 5Institute of Automation, Chinese Academy of Sciences 6Being Beyond
Pseudocode Yes Algorithm 4.1 BPE Image Tokenizer training procedure. ... The main procedure of this algorithm is presented in Algorithm 4.1, with supporting functions detailed in Algorithms B.1 and B.2 (see Appendix B).
Open Source Code No The text is ambiguous or lacks a clear, affirmative statement of release. The abstract mentions 'For further details, visit our website' but does not specify code availability or provide a direct link to a code repository.
Open Datasets Yes We constructed a diverse image dataset comprising 2.78 million images from Image Net (Deng et al., 2009), CC (Sharma et al., 2018), LAION (Schuhmann et al., 2022), and SBU (Ordonez et al., 2011). We evaluated our model using multiple benchmarks: VQAv2 (Goyal et al., 2017), MMBench (Liu et al., 2023), MME (Fu et al., 2023), POPE (Li et al., 2023), Viz Wiz (Gurari et al., 2018). This stage used 595K images from CC-3M (Sharma et al., 2018) and 558K from LCS (Liu et al., 2024b). Ref COCO (50.6K) (Kazemzadeh et al., 2014), AOKVQA (66.2K) (Schwenk et al., 2022), Share GPT4o (57.3K) (Chen et al., 2023), and ALLa VA Inst (70K) (Chen et al., 2024).
Dataset Splits Yes The training process consisted of two stages: Stage 1: Image Understanding Pretraining (PT). This stage used 595K images from CC-3M (Sharma et al., 2018) and 558K from LCS (Liu et al., 2024b). Stage 2: Supervised Fine-Tuning (SFT). This stage used 1.27 million entries from the LLa VA-One Vision Dataset (Li et al., 2024), including General QA (504K), Doc & Chart (249K), Reasoning (343K), and OCR (180K) tasks. For Nocaps, we chose the validation split. For Flickr30k, we selected a 1k-image split.
Hardware Specification Yes We list the hardware resources used in Table G.1. CPU: Intel 3GHz, GPU: Nvidia A800 (80GB) 8, RAM: 1024GB
Software Dependencies No In our code, we have used the following libraries which are covered by the corresponding licenses: Numpy (BSD-3-Clause license) Py Torch (BSD-3-Clause license) Transformers (Apache license) Numba (BSD-2-Clause license). No specific version numbers are provided for these libraries.
Experiment Setup Yes Table B.1: Hyperparameters for Transformer models ... Table B.2: Hyperparameters for the VQ-GAN model ... Table B.3: Hyperparameters for training MLLM ... These tables specify parameters such as batch size, learning rate, weight decay, optimizer, number of heads, layers, embedding dimension, embedding dimension, codebook size, z channels, resolution, dropout, gradient accumulation, learning schedule, warmup ratio, epoch, and deepspeed stage.