Subobject-level Image Tokenization

Authors: Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Intrinsic evaluations across 5 datasets demonstrate that EPOC s segmentation aligns well with human annotations of both objectand part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
Researcher Affiliation Collaboration 1Meta FAIR Paris 2The Hong Kong University of Science and Technology 3Alibaba Group 4Zillow.
Pseudocode No The paper describes the EPOC method in Section 3.2.3 and visually in Figure 2, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Project website: https://github. com/Chen Delong1999/subobjects.
Open Datasets Yes Intrinsic evaluations across 5 datasets... with COCO s COCONut relabeled validation split (Deng et al., 2024) and ADE-20K (Zhou et al., 2019) validation split provide object-level annotations, and Pascal Panoptic Parts (PPP) (de Geus et al., 2021), Part Image Net++ (PIN++) (Li et al., 2024) and SA-1B (Kirillov et al., 2023) consist subobject-level annotations. ...Image Net-1K (Deng et al., 2009)... Share GPT4V (Chen et al., 2024)... Pixmo-cap (Deitke et al., 2024)... CLEVR-cap generated from CLEVR (Johnson et al., 2017)...
Dataset Splits Yes For COCO, PPP, PIN++, and SA1B, we randomly sample 3k images for efficient evaluation. For ADE-20K, we include all 2k samples in the validation set. ...Image Net-1k and CLEVR provide official validation splits, we use 5k samples from them for efficiency. For Pixmo-cap, we randomly sample 5k samples as validation, and for Share GPT-4v, we treat 5k samples randomly selected from the GPT-4V generated captions as validation split.
Hardware Specification Yes We measure throughput with a V100 (32GB) an 30 CPU cores... The training was performed on a single NVIDIA 8 A100 machine.
Software Dependencies No The paper mentions using a SegFormer-b0 model and the scikit-image library, but it does not specify version numbers for these or other key software components like Python, PyTorch, or CUDA.
Experiment Setup Yes We use a two-layer MLP as the connector between embeddings and LLM. The width is 4 of LLM s hidden state dimension. We freeze the image feature extractor and do end-to-end fine-tune the small MLP projection plus the LLM. For CLEVR-cap, Image Net-1k, Share GPT4V, and Pixmo-cap datasets, we respectively train the model for 30, 1, 1, 3 epochs, with a batch size of 512, 256, 256, and 256. Max tokens are set to 100 for EPOC and 64 for Mask2Former tokenizer. We use Adam W with learning rate 1 10 4, cosine decay or constant scheduling, and 500 warmup steps. Mixed-precision (bf16) is used to accelerate training.