MaskPrompt: Open-Vocabulary Affordance Segmentation with Object Shape Mask Prompts
Authors: Dongpan Chen, Dehui Kong, Jinghua Li, Baocai Yin
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Quantitative and qualitative evaluations compared with state-of-the-art methods demonstrate that the proposed method achieves superior performance on the proposed benchmark dataset and other open-vocabulary part segmentation datasets. We conduct extensive experiments on the benchmark and other object part segmentation datasets, which demonstrates the effectiveness of our proposed method. |
| Researcher Affiliation | Academia | Dongpan Chen, Dehui Kong , Jinghua Li, Baocai Yin School of Information Science and Technology, Beijing University of Technology, Beijing, China EMAIL, EMAIL |
| Pseudocode | No | The paper describes the architecture and methodology in detail, outlining the steps and components, but does not include any formal pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We combine existing affordance segmentation dataset IITAFF (Nguyen et al. 2017) and part segmentation dataset Pascal-Part-108 (Michieli et al. 2020), and re-annotate labels according to the affordances of target entities (including objects, humans and animals) to construct an open-vocabulary affordance segmentation dataset, namely OVAS25. ... We also evaluate the proposed model on another affordance segmentation dataset UMD (Myers et al. 2015) and other part segmentation datasets, i.e., Pascal-Part-58 (Chen et al. 2014), Pascal-Part-116 (Wei et al. 2024), Pascal-Part-201 (Singh et al. 2022), and ADE20K-Part-234 (Wei et al. 2024). |
| Dataset Splits | Yes | OVAS-25 has 28 entity classes and 25 affordance classes (as shown in Fig. 1), totalling 18938 images, of which 11363 are used for training and 7575 for testing. |
| Hardware Specification | Yes | All experiments are conducted on a NVIDIA A800 80GB GPU. |
| Software Dependencies | No | The paper mentions using specific tools like DETR, SAM, Alpha-CLIP, CLIP's textencoder, and Mask Former, but does not provide specific version numbers for any of these software components or other libraries. |
| Experiment Setup | Yes | We train the whole model for 120K iterations with a learning rate of 10 4 decreased by 10 times at 60K and 100K iterations. We optimize the network by Adam W with the weight decay 10 4 and batch size 32. The layers of pixel decoder L is 6. In each layer, the embedding dimension is 768, the head number of the multihead attention is 12, d is 512, and the hidden dimension of the feed-forward network is 3072. For the dimensions of text and vision features, dt, dv, dvt, and dcls are all 512. |