Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Authors: Leonidas Gee, Wing Yan Li, Viktoriia Sharmanska, Novi Quadrianto
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we demonstrate a reduction in energy consumed of up to 47%. Comparative approaches of 8-bit quantization and token merging can lead to significantly increased energy costs (up to 500% or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance. The experimental code for our paper is also made publicly available1. |
| Researcher Affiliation | Academia | Leonidas Gee EMAIL Predictive Analytics Lab, University of Sussex, UK Wing Yan Li EMAIL University of Surrey, UK Viktoriia Sharmanska EMAIL Predictive Analytics Lab, University of Sussex, UK Novi Quadrianto EMAIL Predictive Analytics Lab, University of Sussex, UK Basque Center for Applied Mathematics, Spain Monash University, Indonesia |
| Pseudocode | No | The paper describes the methods (intra-image and inter-image approaches) verbally and illustrates them with a high-level flowchart in Figure 2. However, it does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | The experimental code for our paper is also made publicly available1. 1https://github.com/wearepal/visual-word-tokenizer |
| Open Datasets | Yes | We conduct our analysis through the lens of (i) classification performance of visual recognition, (ii) subgroup robustness, and (iii) generative performance of visual captioning. For (i) and (ii), we utilize three publicly available datasets (Waterbirds (Wah et al., 2011), Celeb A (Liu et al., 2015), Meta Shift (Liang & Zou, 2022)) that are typical benchmarks in robustness and fairness research (Sagawa et al., 2019; Liu et al., 2021; Yang et al., 2023). ... For (i), we also conduct a large-scale evaluation on the Open Images v6 dataset (Kuznetsova et al., 2020). ... For (iii), we utilize the Karpathy test split of COCO dataset (Lin et al., 2014) and a validation set of No Caps dataset (Agrawal et al., 2019)... |
| Dataset Splits | Yes | Further details on the defined subgroups are provided in Appendix A.1. Table 8: Defined subgroups in Waterbirds, Celeb A, and Meta Shift. Waterbirds: Training (3498, 184, 56, 1057), Validation (467, 466, 133, 133), Test (2255, 2255, 642, 642). |
| Hardware Specification | Yes | Lastly, our experiments are conducted using a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions loading pre-trained models from Hugging Face for CLIP and BLIP, and using `sklearn.cluster.MiniBatchKMeans`. However, it does not specify version numbers for these libraries, Python, or any other critical software components. |
| Experiment Setup | Yes | For image classification, we load the pre-trained CLIP (Radford et al., 2021) model from Hugging Face6. An image size of 224 224 is used with bilinear interpolation for CLIP. For image captioning, we load the pretrained BLIP (Li et al., 2022) model from Hugging Face7. To perform zero-shot captioning, we use a beam size of 3 along with maximum and minimum lengths of 20 and 5, respectively. An image size of 384 384 is used with bicubic interpolation. ... For the VWTs, we set the top-k of the intra-image approach to 50% of the total number of patches which we denote as T 0.5 intra as we found it to work best. ... For the inter-image approach, we set the threshold to 0.1 unless stated otherwise ... Lastly, our experiments are conducted using a single NVIDIA A100 GPU. Since our focus is on the online setting (real-time), we set the batch size to 1 unless stated otherwise. |