From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Authors: Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, Zongqing Lu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs multimodal understanding capabilities, even with limited training data. Leveraging this method, we develop Being-VL-0, a model that demonstrates superior performance across various benchmarks and shows promising scalability, potentially paving the way for more efficient and capable multimodal foundation models. |
| Researcher Affiliation | Collaboration | Wanpeng Zhang1 , Zilong Xie2, Yicheng Feng1, Yijiang Li3, Xingrun Xing4,5, Sipeng Zheng4, Zongqing Lu1,6 1School of Computer Science, Peking University 2The Chinese University of Hong Kong 3University of California, San Diego 4Beijing Academy of Artificial Intelligence 5Institute of Automation, Chinese Academy of Sciences 6Being Beyond |
| Pseudocode | Yes | Algorithm 4.1 BPE Image Tokenizer training procedure. ... The main procedure of this algorithm is presented in Algorithm 4.1, with supporting functions detailed in Algorithms B.1 and B.2 (see Appendix B). |
| Open Source Code | No | The text is ambiguous or lacks a clear, affirmative statement of release. The abstract mentions 'For further details, visit our website' but does not specify code availability or provide a direct link to a code repository. |
| Open Datasets | Yes | We constructed a diverse image dataset comprising 2.78 million images from Image Net (Deng et al., 2009), CC (Sharma et al., 2018), LAION (Schuhmann et al., 2022), and SBU (Ordonez et al., 2011). We evaluated our model using multiple benchmarks: VQAv2 (Goyal et al., 2017), MMBench (Liu et al., 2023), MME (Fu et al., 2023), POPE (Li et al., 2023), Viz Wiz (Gurari et al., 2018). This stage used 595K images from CC-3M (Sharma et al., 2018) and 558K from LCS (Liu et al., 2024b). Ref COCO (50.6K) (Kazemzadeh et al., 2014), AOKVQA (66.2K) (Schwenk et al., 2022), Share GPT4o (57.3K) (Chen et al., 2023), and ALLa VA Inst (70K) (Chen et al., 2024). |
| Dataset Splits | Yes | The training process consisted of two stages: Stage 1: Image Understanding Pretraining (PT). This stage used 595K images from CC-3M (Sharma et al., 2018) and 558K from LCS (Liu et al., 2024b). Stage 2: Supervised Fine-Tuning (SFT). This stage used 1.27 million entries from the LLa VA-One Vision Dataset (Li et al., 2024), including General QA (504K), Doc & Chart (249K), Reasoning (343K), and OCR (180K) tasks. For Nocaps, we chose the validation split. For Flickr30k, we selected a 1k-image split. |
| Hardware Specification | Yes | We list the hardware resources used in Table G.1. CPU: Intel 3GHz, GPU: Nvidia A800 (80GB) 8, RAM: 1024GB |
| Software Dependencies | No | In our code, we have used the following libraries which are covered by the corresponding licenses: Numpy (BSD-3-Clause license) Py Torch (BSD-3-Clause license) Transformers (Apache license) Numba (BSD-2-Clause license). No specific version numbers are provided for these libraries. |
| Experiment Setup | Yes | Table B.1: Hyperparameters for Transformer models ... Table B.2: Hyperparameters for the VQ-GAN model ... Table B.3: Hyperparameters for training MLLM ... These tables specify parameters such as batch size, learning rate, weight decay, optimizer, number of heads, layers, embedding dimension, embedding dimension, codebook size, z channels, resolution, dropout, gradient accumulation, learning schedule, warmup ratio, epoch, and deepspeed stage. |