TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models
Authors: Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. 4 Experiments We show the strong region classification capabilities of Text Region in Sec. 4.1 |
| Researcher Affiliation | Academia | Siebel School of Computing and Data Science University of Illinois at Urbana-Champaign EMAIL |
| Pseudocode | No | The paper describes the methods using mathematical formulas and descriptive text in sections like '3.2 Text Region Approach' and 'A.3 CLIP Variants', but does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/avaxiao/Text Region. |
| Open Datasets | Yes | We evaluate on six widely used semantic segmentation benchmarks: PASCAL VOC 2012 (Everingham et al., 2015), PASCAL Context (Mottaghi et al., 2014), COCO-Stuff (Caesar et al., 2018), COCOObject (Lin et al., 2014), Cityscapes (Cordts et al., 2016), ADE20K (Zhou et al., 2019). |
| Dataset Splits | No | The paper evaluates on well-known benchmarks (e.g., PASCAL VOC 2012, COCO, Ref COCO) for which standard splits are typically used. However, it does not explicitly state the specific training, validation, and test split percentages, sample counts, or direct citations for the splits within the main text. |
| Hardware Specification | Yes | This work used NVIDIA GPUs at NCSA Delta through allocation CIS240059 and CIS250059 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program... measured on one A100 GPU. |
| Software Dependencies | No | The paper mentions using SAM2 (Ravi et al., 2024) with a Hiera-Large backbone and specific configurations, but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA, which are crucial for reproducibility. |
| Experiment Setup | Yes | For all experiments, we filter global patches using a threshold of τ 0.07. The crop size is uniformly set to 336 for all CLIP models (Vi T-B/16 through Vi T-H/14), while Sig LIP2 and Perception Encoder use their respective default input resolutions. Region masks are generated with SAM2 (Ravi et al., 2024) Hiera-Large, using the following configuration: pred-iou-thresh set to 0.6, stability-score-thresh to 0.6, box-nms-thresh to 0.9, and points-per-side to 16. In the semantic segmentation experiments on the Cityscapes dataset, we increase points-per-side to 36 to due to its high resolution and the abundance of small objects. To mitigate the impact of duplicated or overlapping masks, we also merge masks with an overlap Io U greater than 0.8. |