Attention-Driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models Without Fine-Tuning

Authors: Hai-Ming Xu, Qi Chen, Lei Wang, Lingqiao Liu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we compare our method to the SOTA ones on three benchmarks, each designed to test our method from different perspectives. Besides, we conduct several ablation studies to further analyze the effectiveness of our method. [...] Results As shown in Tab. 1, the foundation MLLM Qwen VL-Chat, while capable of detecting general objects, struggles to localize query text in the OCG task.
Researcher Affiliation Academia 1Australian Institute for Machine Learning, The University of Adelaide 2University of Wollongong EMAIL, lei EMAIL
Pseudocode No The paper describes the method using textual descriptions and a pipeline diagram (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Heiming X/TAG.git
Open Datasets Yes Mind2Web (Deng et al. 2024) dataset. [...] Screen Spot dataset which is a realistic grounding evaluation dataset proposed by (Cheng et al. 2024)
Dataset Splits No The paper uses existing benchmarks like Screen Spot and Mind2Web and refers to their evaluation setups, but it does not explicitly provide the specific training/test/validation dataset splits (e.g., percentages or counts) within the text for reproducing the data partitioning.
Hardware Specification Yes all experiments can be conducted on one NVIDIA RTX 4090 GPU.
Software Dependencies No The paper mentions various models and APIs (e.g., Mini CPM-Llama3-V 2.5, Azure Vision API tool) but does not provide specific ancillary software dependencies with version numbers (e.g., Python version, library versions) needed to replicate the experiment.
Experiment Setup Yes Based on these results, we use K = 10 in all experiments. [...] Thus δ = 0.5 is used across all datasets.