reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Grounding Multimodal Large Language Model in GUI World

Authors: Weixian Lei, Difei Gao, Mike Zheng Shou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach demonstrates superior performance in task accuracy and adaptability, as validated by benchmarks such as Screen Spot, Mini Wob, AITW, and Mind2Web. Our code and data are released at https://github.com/showlab/Assist GUIGround.
Researcher Affiliation	Academia	Weixian Lei, Difei Gao, Mike Zheng Shou B Show Lab, National University of Singapore
Pseudocode	No	The paper describes the model architecture and training procedures in detail using prose and diagrams (Figure 2, Figure 3), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Our code and data are released at https://github.com/showlab/Assist GUIGround.
Open Datasets	Yes	Our code and data are released at https://github.com/showlab/Assist GUIGround. In total, we collected 0.5M screenshots with 35M UI element annotations. Our collected data covers a broad range of topics and features rich elements with dense annotations. We provide a comparison of our collected dataset with previous GUI data in Table. 1. ... Mini Wob (Shi et al., 2017), AITW (Rawles et al., 2023), and Mind2Web (Deng et al., 2024). ... Widget Caption (Li et al., 2020) and RICO (Deka et al., 2017).
Dataset Splits	Yes	Following the approach detailed in (Cheng et al., 2024), we apply the same train/test split based on instructions, retaining a single trajectory per instruction and ensuring no overlap between the training and test sets. Following (Cheng et al., 2024), we conduct 2.8K episode rollouts for training.
Hardware Specification	Yes	Training runs on 32 V100 GPUs with a global batch size of 128 for 150K steps. ... For vision-based agent tasks, we fine-tune with 2 A100 GPUs, applying Lo RA (Hu et al., 2021) tuning (rank 8, alpha 16) for the language model.
Software Dependencies	No	The paper mentions using the Adam W optimizer and LoRA tuning, which are methods. It does not provide specific version numbers for software libraries or programming languages like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Training runs on 32 V100 GPUs with a global batch size of 128 for 150K steps. The Adam W (Loshchilov, 2017) optimizer is used with β1 = 0.9, β2 = 0.98, and a weight decay of 1e-4. A Cosine Annealing scheduler manages the learning rate, starting with a warm-up over 200 steps. The max learning rate is 1e-3, dropping to 5e-5. Hyperparameters for the training objective are set as λmse = 10, Ll1 = 5, and λGIo U = 2. For vision-based agent tasks, we fine-tune with 2 A100 GPUs, applying Lo RA (Hu et al., 2021) tuning (rank 8, alpha 16) for the language model. ... The training objective hyperparameters are λtext = 1 and Lgrd = 1.