Subobject-level Image Tokenization
Authors: Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Intrinsic evaluations across 5 datasets demonstrate that EPOC s segmentation aligns well with human annotations of both objectand part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens. |
| Researcher Affiliation | Collaboration | 1Meta FAIR Paris 2The Hong Kong University of Science and Technology 3Alibaba Group 4Zillow. |
| Pseudocode | No | The paper describes the EPOC method in Section 3.2.3 and visually in Figure 2, but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project website: https://github. com/Chen Delong1999/subobjects. |
| Open Datasets | Yes | Intrinsic evaluations across 5 datasets... with COCO s COCONut relabeled validation split (Deng et al., 2024) and ADE-20K (Zhou et al., 2019) validation split provide object-level annotations, and Pascal Panoptic Parts (PPP) (de Geus et al., 2021), Part Image Net++ (PIN++) (Li et al., 2024) and SA-1B (Kirillov et al., 2023) consist subobject-level annotations. ...Image Net-1K (Deng et al., 2009)... Share GPT4V (Chen et al., 2024)... Pixmo-cap (Deitke et al., 2024)... CLEVR-cap generated from CLEVR (Johnson et al., 2017)... |
| Dataset Splits | Yes | For COCO, PPP, PIN++, and SA1B, we randomly sample 3k images for efficient evaluation. For ADE-20K, we include all 2k samples in the validation set. ...Image Net-1k and CLEVR provide official validation splits, we use 5k samples from them for efficiency. For Pixmo-cap, we randomly sample 5k samples as validation, and for Share GPT-4v, we treat 5k samples randomly selected from the GPT-4V generated captions as validation split. |
| Hardware Specification | Yes | We measure throughput with a V100 (32GB) an 30 CPU cores... The training was performed on a single NVIDIA 8 A100 machine. |
| Software Dependencies | No | The paper mentions using a SegFormer-b0 model and the scikit-image library, but it does not specify version numbers for these or other key software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We use a two-layer MLP as the connector between embeddings and LLM. The width is 4 of LLM s hidden state dimension. We freeze the image feature extractor and do end-to-end fine-tune the small MLP projection plus the LLM. For CLEVR-cap, Image Net-1k, Share GPT4V, and Pixmo-cap datasets, we respectively train the model for 30, 1, 1, 3 epochs, with a batch size of 512, 256, 256, and 256. Max tokens are set to 100 for EPOC and 64 for Mask2Former tokenizer. We use Adam W with learning rate 1 10 4, cosine decay or constant scheduling, and 500 warmup steps. Mixed-precision (bf16) is used to accelerate training. |