Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors
Authors: Peiran Xu, Yadong MU
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that the performance of the proposed model has achieved a breakthrough improvement over existing methods. Our codes are available at https://github.com/woyut/WSAG-PLSP . We conduct experiments on AGD20K (Luo et al., 2022), a widely-used WSAG dataset. The evaluation metrics include Kullback-Leibler Divergence (KLD), Similarity (SIM), and Normalized Scanpath Saliency (NSS), which are consistent with previous works. We perform ablation studies to examine the effectiveness of each proposed module, and the results are shown in Table 2. |
| Researcher Affiliation | Academia | Peiran Xu, Yadong Mu Wangxuan Institute of Computer Technology Peking University Beijing, China EMAIL |
| Pseudocode | No | The paper describes methods and mathematical formulations in regular paragraph text and equations (e.g., Lalign = 1 Cos-Sim(f A, stop-grad(f E)), Lall = LKL + λ1(Lalign + Lexo-cls) + λ2Lreason), but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | Yes | Our codes are available at https://github.com/woyut/WSAG-PLSP . |
| Open Datasets | Yes | We conduct experiments on AGD20K (Luo et al., 2022), a widely-used WSAG dataset. |
| Dataset Splits | Yes | Following prior arts, two types of data split are considered in the experiments. For the seen split, all object categories exist in the training set. While for the unseen split, there is no intersection between the object categories in the training set and the test set, meaning that the model needs to generalize the knowledge of affordance to novel objects during testing. Affordance LLM (Qian et al., 2024) has defined a new split for AGD20K, namely the hard split, which is similar to the unseen split but requires a higher degree of generalization. Following the fully supervised setting (Appendix C), it has 868/807 images with dense annotations for training/testing. |
| Hardware Specification | Yes | In Table 3, we also compare the inference time across different models. It is evaluated on an NVIDIA Ge Force RTX 2080Ti, and the batch size is set to 1. The whole training process can be performed on a single NVIDIA Ge Force RTX 2080Ti. We deploy the model trained on AGD20K s seen split to an Aubo-i5 robotic arm with a Robotiq 2F-85 two-finger gripper. The observation of the scene is captured by an on-hand Intel Real Sense D435i RGB-D camera. |
| Software Dependencies | No | The code is implemented in Py Torch. We use the Adam W optimizer. We adopt CLIP (Radford et al., 2021) (Vi T-B/16) as the visual and textual encoder... The mask decoder is adapted from the decoder of SAM (Kirillov et al., 2023). However, specific version numbers for PyTorch, AdamW, or other libraries are not provided. |
| Experiment Setup | Yes | We train the model for 40 epochs using the Adam W optimizer. The learning rate is set to 1e-4 and the batch size is set to 20. The learning rate of the CLIP visual encoder is reduced to 1e-5 to prevent losing important semantic information acquired during its pre-training stage, while the CLIP text encoder is frozen. The loss coefficients λ1 and λ2 are set to 10 and 1, respectively. We use random cropping, random flip, and the stitching technique mentioned in Section 3.6 as augmentation. We train the model for 40 epochs using the Adam W optimizer, with the learning rate set to 1e-4, betas set to (0.9, 0.95), and weight decay coefficient set to 0.01. The learning rate of the visual encoder is reduced to 1e-5 to prevent losing important semantic information acquired during CLIP s pre-training stage. The batch size is 20. Each of the 20 egocentric images is accompanied by an exocentric image. |