Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers

Authors: Leonidas Gee, Wing Yan Li, Viktoriia Sharmanska, Novi Quadrianto

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we demonstrate a reduction in energy consumed of up to 47%. Comparative approaches of 8-bit quantization and token merging can lead to significantly increased energy costs (up to 500% or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance. The experimental code for our paper is also made publicly available1.
Researcher Affiliation Academia Leonidas Gee EMAIL Predictive Analytics Lab, University of Sussex, UK Wing Yan Li EMAIL University of Surrey, UK Viktoriia Sharmanska EMAIL Predictive Analytics Lab, University of Sussex, UK Novi Quadrianto EMAIL Predictive Analytics Lab, University of Sussex, UK Basque Center for Applied Mathematics, Spain Monash University, Indonesia
Pseudocode No The paper describes the methods (intra-image and inter-image approaches) verbally and illustrates them with a high-level flowchart in Figure 2. However, it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes The experimental code for our paper is also made publicly available1. 1https://github.com/wearepal/visual-word-tokenizer
Open Datasets Yes We conduct our analysis through the lens of (i) classification performance of visual recognition, (ii) subgroup robustness, and (iii) generative performance of visual captioning. For (i) and (ii), we utilize three publicly available datasets (Waterbirds (Wah et al., 2011), Celeb A (Liu et al., 2015), Meta Shift (Liang & Zou, 2022)) that are typical benchmarks in robustness and fairness research (Sagawa et al., 2019; Liu et al., 2021; Yang et al., 2023). ... For (i), we also conduct a large-scale evaluation on the Open Images v6 dataset (Kuznetsova et al., 2020). ... For (iii), we utilize the Karpathy test split of COCO dataset (Lin et al., 2014) and a validation set of No Caps dataset (Agrawal et al., 2019)...
Dataset Splits Yes Further details on the defined subgroups are provided in Appendix A.1. Table 8: Defined subgroups in Waterbirds, Celeb A, and Meta Shift. Waterbirds: Training (3498, 184, 56, 1057), Validation (467, 466, 133, 133), Test (2255, 2255, 642, 642).
Hardware Specification Yes Lastly, our experiments are conducted using a single NVIDIA A100 GPU.
Software Dependencies No The paper mentions loading pre-trained models from Hugging Face for CLIP and BLIP, and using `sklearn.cluster.MiniBatchKMeans`. However, it does not specify version numbers for these libraries, Python, or any other critical software components.
Experiment Setup Yes For image classification, we load the pre-trained CLIP (Radford et al., 2021) model from Hugging Face6. An image size of 224 224 is used with bilinear interpolation for CLIP. For image captioning, we load the pretrained BLIP (Li et al., 2022) model from Hugging Face7. To perform zero-shot captioning, we use a beam size of 3 along with maximum and minimum lengths of 20 and 5, respectively. An image size of 384 384 is used with bicubic interpolation. ... For the VWTs, we set the top-k of the intra-image approach to 50% of the total number of patches which we denote as T 0.5 intra as we found it to work best. ... For the inter-image approach, we set the threshold to 0.1 unless stated otherwise ... Lastly, our experiments are conducted using a single NVIDIA A100 GPU. Since our focus is on the online setting (real-time), we set the batch size to 1 unless stated otherwise.