reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attention-Driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models Without Fine-Tuning

Authors: Hai-Ming Xu, Qi Chen, Lei Wang, Lingqiao Liu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we compare our method to the SOTA ones on three benchmarks, each designed to test our method from different perspectives. Besides, we conduct several ablation studies to further analyze the effectiveness of our method. [...] Results As shown in Tab. 1, the foundation MLLM Qwen VL-Chat, while capable of detecting general objects, struggles to localize query text in the OCG task.
Researcher Affiliation	Academia	1Australian Institute for Machine Learning, The University of Adelaide 2University of Wollongong EMAIL, lei EMAIL
Pseudocode	No	The paper describes the method using textual descriptions and a pipeline diagram (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Heiming X/TAG.git
Open Datasets	Yes	Mind2Web (Deng et al. 2024) dataset. [...] Screen Spot dataset which is a realistic grounding evaluation dataset proposed by (Cheng et al. 2024)
Dataset Splits	No	The paper uses existing benchmarks like Screen Spot and Mind2Web and refers to their evaluation setups, but it does not explicitly provide the specific training/test/validation dataset splits (e.g., percentages or counts) within the text for reproducing the data partitioning.
Hardware Specification	Yes	all experiments can be conducted on one NVIDIA RTX 4090 GPU.
Software Dependencies	No	The paper mentions various models and APIs (e.g., Mini CPM-Llama3-V 2.5, Azure Vision API tool) but does not provide specific ancillary software dependencies with version numbers (e.g., Python version, library versions) needed to replicate the experiment.
Experiment Setup	Yes	Based on these results, we use K = 10 in all experiments. [...] Thus δ = 0.5 is used across all datasets.