ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition

Authors: Seungdong Yoa, Seungjun Lee, Hye-Seung Cho, Bumsoo Kim, Woohyung Lim

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct extensive experiments to demonstrate our retokenization strategy for efficient Vi Ts. We first show that our method outperforms the baselines by a substantial margin. Analytical experiments are conducted to further explain the factors contributing to our method s effectiveness. Notably, the hyper-speed inference experiment in Fig. 3 reveals that our method achieves relatively robust performance even with a drastically reduced number of tokens.
Researcher Affiliation Collaboration 1LG AI Research 2Chung-ang University EMAIL, EMAIL
Pseudocode No The paper describes its methodology in natural language text and provides an architectural diagram, but it does not include explicitly structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code or provide a link to a code repository.
Open Datasets Yes We conduct all the experiments on Image Net-1k (Deng et al. 2009) consisting of 1.2 million images in the training set and 50k images in the test set.
Dataset Splits Yes We conduct all the experiments on Image Net-1k (Deng et al. 2009) consisting of 1.2 million images in the training set and 50k images in the test set.
Hardware Specification Yes The throughput (img/s) is measured on a single NVIDIA Ge Force RTX 3090 during inference.
Software Dependencies No The paper does not specify any software dependencies with version numbers, such as programming languages or libraries.
Experiment Setup Yes Implementation details. We conduct all the experiments on Image Net-1k (Deng et al. 2009) consisting of 1.2 million images in the training set and 50k images in the test set. The image resolution is 224 224 in training and testing. During training our models, we simply follow all the training strategies and optimization methods used in Dei T (Touvron et al. 2021a). We train our model from scratch for 300 epochs, and we don t use any tricks (e.g., adding extra parameters, starting from an existing checkpoint or fine-tuning, using additional training tricks), unlike other prior works. The throughput (img/s) is measured on a single NVIDIA Ge Force RTX 3090 during inference. For our method to apply the local coherence bias module, we adopt simple convolutional layers (four 3 3 convolutions and a single 1 1 convolution), replacing a standard Vi T s patchify stem. In all experiments, we set the proportion p of the non-semantic token set to 0.3, targeting the bottom 30% of tokens based on their importance (attentiveness). Therefore, these tokens are candidates for retokenization. We set the similarity merging ratio to 0.08, meaning that token pairs equivalent to 0.08 the total number of tokens from the non-semantic token set are selected for merging based on their similarity. Also, we set the pruning ratio r to 0.8, which results in discarding the bottom 20% of tokens, identified as non-semantic, after retokenization by Image Piece. These hyperparameters are specifically adjusted to further accelerate Vi Ts beyond the standard settings.