Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension
Authors: Yaxian Wang, Henghui Ding, Shuting He, Xudong Jiang, Bifan Wei, Jun Liu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results show that Hie A2G achieves new state-of-the-art performance on the challenging GREC task and also the other 4 tasks, including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES), demonstrating the remarkable superiority and generalizability of the proposed Hie A2G. |
| Researcher Affiliation | Academia | 1School of Computer Science and Technology, Xi an Jiaotong University, China 2Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Xi an Jiaotong University, China 3Institute of Big Data, Fudan University, China 4Shanghai University of Finance and Economics, China 5Nanyang Technological University, Singapore 6School of Continuing Education, Xi an Jiaotong University, China 7Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, Xi an Jiaotong University, China EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations (e.g., Lw2o = α(1 cos(Tw, ˆTw))), but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | The proposed Hie A2G is evaluated mainly on the g Ref COCO dataset (He et al. 2023; Liu, Ding, and Jiang 2023) for GREC and GRES. We also conducted experiments on a phrase grounding dataset called Flickr30K Entities (Plummer et al. 2015), and three widely-used REC and RES benchmarks including Ref COCO (Yu et al. 2016), Ref COCO+ (Yu et al. 2016), and Ref COCOg (Mao et al. 2016). |
| Dataset Splits | Yes | Results on GREC. As shown in Table 1, our Hie A2G with the Res Net101 backbone achieves superior performance on both metrics across three splits of the g Ref COCO dataset. It shows an average performance gain of 14.2% in Pr@(F1=1, Io U 0.5) over Ferret (You et al. 2024) using a Multimodal Large Language Model (MLLM), and an average performance gain of 9.4% in N-acc. over UNITEXT (Yan et al. 2023). These results indicate that Hie A2G has a significant advantage in handling various types of text expressions to flexibly detect target objects ranging from zero to multiple. Results on REC. As illustrated in Table 2, Hie A2G achieves consistent performance gains across all splits of the three datasets compared to existing classic REC methods. |
| Hardware Specification | No | The paper mentions "GPU memory" in the context of batch size limitations, but does not provide specific details such as GPU model, CPU type, or memory capacity used for experiments. "However, the size of the batch size is limited by GPU memory." |
| Software Dependencies | No | The paper states: "We adopt Res Net101 (He et al. 2016) and Swin-B (Liu et al. 2021) as our visual encoder, and Ro BERTa-base (Liu et al. 2019b) as our text encoder." These are model architectures and not specific software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8). |
| Experiment Setup | No | The paper defines loss functions and mentions hyperparameters (λ, α, τ) but does not provide their specific values (e.g., learning rate, batch size, number of epochs). "Ldet = λbbox Lbbox + λgiou Lgiou + λclass Lclass, (13) Lseg = λmask Lmask + λdice Ldice, (14) where λ are the hyperparameters..." and "where α is set to 0 when given a no-target sample, otherwise set to 1." and "where τ is a temperature hyperparameter." |