MaskPrompt: Open-Vocabulary Affordance Segmentation with Object Shape Mask Prompts

Authors: Dongpan Chen, Dehui Kong, Jinghua Li, Baocai Yin

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Quantitative and qualitative evaluations compared with state-of-the-art methods demonstrate that the proposed method achieves superior performance on the proposed benchmark dataset and other open-vocabulary part segmentation datasets. We conduct extensive experiments on the benchmark and other object part segmentation datasets, which demonstrates the effectiveness of our proposed method.
Researcher Affiliation Academia Dongpan Chen, Dehui Kong , Jinghua Li, Baocai Yin School of Information Science and Technology, Beijing University of Technology, Beijing, China EMAIL, EMAIL
Pseudocode No The paper describes the architecture and methodology in detail, outlining the steps and components, but does not include any formal pseudocode blocks or algorithms.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets Yes We combine existing affordance segmentation dataset IITAFF (Nguyen et al. 2017) and part segmentation dataset Pascal-Part-108 (Michieli et al. 2020), and re-annotate labels according to the affordances of target entities (including objects, humans and animals) to construct an open-vocabulary affordance segmentation dataset, namely OVAS25. ... We also evaluate the proposed model on another affordance segmentation dataset UMD (Myers et al. 2015) and other part segmentation datasets, i.e., Pascal-Part-58 (Chen et al. 2014), Pascal-Part-116 (Wei et al. 2024), Pascal-Part-201 (Singh et al. 2022), and ADE20K-Part-234 (Wei et al. 2024).
Dataset Splits Yes OVAS-25 has 28 entity classes and 25 affordance classes (as shown in Fig. 1), totalling 18938 images, of which 11363 are used for training and 7575 for testing.
Hardware Specification Yes All experiments are conducted on a NVIDIA A800 80GB GPU.
Software Dependencies No The paper mentions using specific tools like DETR, SAM, Alpha-CLIP, CLIP's textencoder, and Mask Former, but does not provide specific version numbers for any of these software components or other libraries.
Experiment Setup Yes We train the whole model for 120K iterations with a learning rate of 10 4 decreased by 10 times at 60K and 100K iterations. We optimize the network by Adam W with the weight decay 10 4 and batch size 32. The layers of pixel decoder L is 6. In each layer, the embedding dimension is 768, the head number of the multihead attention is 12, d is 512, and the hidden dimension of the feed-forward network is 3072. For the dimensions of text and vision features, dt, dv, dvt, and dcls are all 512.