TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Authors: Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. 4 Experiments We show the strong region classification capabilities of Text Region in Sec. 4.1
Researcher Affiliation Academia Siebel School of Computing and Data Science University of Illinois at Urbana-Champaign EMAIL
Pseudocode No The paper describes the methods using mathematical formulas and descriptive text in sections like '3.2 Text Region Approach' and 'A.3 CLIP Variants', but does not contain explicit pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https://github.com/avaxiao/Text Region.
Open Datasets Yes We evaluate on six widely used semantic segmentation benchmarks: PASCAL VOC 2012 (Everingham et al., 2015), PASCAL Context (Mottaghi et al., 2014), COCO-Stuff (Caesar et al., 2018), COCOObject (Lin et al., 2014), Cityscapes (Cordts et al., 2016), ADE20K (Zhou et al., 2019).
Dataset Splits No The paper evaluates on well-known benchmarks (e.g., PASCAL VOC 2012, COCO, Ref COCO) for which standard splits are typically used. However, it does not explicitly state the specific training, validation, and test split percentages, sample counts, or direct citations for the splits within the main text.
Hardware Specification Yes This work used NVIDIA GPUs at NCSA Delta through allocation CIS240059 and CIS250059 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program... measured on one A100 GPU.
Software Dependencies No The paper mentions using SAM2 (Ravi et al., 2024) with a Hiera-Large backbone and specific configurations, but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA, which are crucial for reproducibility.
Experiment Setup Yes For all experiments, we filter global patches using a threshold of τ 0.07. The crop size is uniformly set to 336 for all CLIP models (Vi T-B/16 through Vi T-H/14), while Sig LIP2 and Perception Encoder use their respective default input resolutions. Region masks are generated with SAM2 (Ravi et al., 2024) Hiera-Large, using the following configuration: pred-iou-thresh set to 0.6, stability-score-thresh to 0.6, box-nms-thresh to 0.9, and points-per-side to 16. In the semantic segmentation experiments on the Cityscapes dataset, we increase points-per-side to 36 to due to its high resolution and the abundance of small objects. To mitigate the impact of duplicated or overlapping masks, we also merge masks with an overlap Io U greater than 0.8.