Attention-Driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models Without Fine-Tuning
Authors: Hai-Ming Xu, Qi Chen, Lei Wang, Lingqiao Liu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we compare our method to the SOTA ones on three benchmarks, each designed to test our method from different perspectives. Besides, we conduct several ablation studies to further analyze the effectiveness of our method. [...] Results As shown in Tab. 1, the foundation MLLM Qwen VL-Chat, while capable of detecting general objects, struggles to localize query text in the OCG task. |
| Researcher Affiliation | Academia | 1Australian Institute for Machine Learning, The University of Adelaide 2University of Wollongong EMAIL, lei EMAIL |
| Pseudocode | No | The paper describes the method using textual descriptions and a pipeline diagram (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/Heiming X/TAG.git |
| Open Datasets | Yes | Mind2Web (Deng et al. 2024) dataset. [...] Screen Spot dataset which is a realistic grounding evaluation dataset proposed by (Cheng et al. 2024) |
| Dataset Splits | No | The paper uses existing benchmarks like Screen Spot and Mind2Web and refers to their evaluation setups, but it does not explicitly provide the specific training/test/validation dataset splits (e.g., percentages or counts) within the text for reproducing the data partitioning. |
| Hardware Specification | Yes | all experiments can be conducted on one NVIDIA RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions various models and APIs (e.g., Mini CPM-Llama3-V 2.5, Azure Vision API tool) but does not provide specific ancillary software dependencies with version numbers (e.g., Python version, library versions) needed to replicate the experiment. |
| Experiment Setup | Yes | Based on these results, we use K = 10 in all experiments. [...] Thus δ = 0.5 is used across all datasets. |