Grounding Multimodal Large Language Model in GUI World

Authors: Weixian Lei, Difei Gao, Mike Zheng Shou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach demonstrates superior performance in task accuracy and adaptability, as validated by benchmarks such as Screen Spot, Mini Wob, AITW, and Mind2Web. Our code and data are released at https://github.com/showlab/Assist GUIGround.
Researcher Affiliation Academia Weixian Lei, Difei Gao, Mike Zheng Shou B Show Lab, National University of Singapore
Pseudocode No The paper describes the model architecture and training procedures in detail using prose and diagrams (Figure 2, Figure 3), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our code and data are released at https://github.com/showlab/Assist GUIGround.
Open Datasets Yes Our code and data are released at https://github.com/showlab/Assist GUIGround. In total, we collected 0.5M screenshots with 35M UI element annotations. Our collected data covers a broad range of topics and features rich elements with dense annotations. We provide a comparison of our collected dataset with previous GUI data in Table. 1. ... Mini Wob (Shi et al., 2017), AITW (Rawles et al., 2023), and Mind2Web (Deng et al., 2024). ... Widget Caption (Li et al., 2020) and RICO (Deka et al., 2017).
Dataset Splits Yes Following the approach detailed in (Cheng et al., 2024), we apply the same train/test split based on instructions, retaining a single trajectory per instruction and ensuring no overlap between the training and test sets. Following (Cheng et al., 2024), we conduct 2.8K episode rollouts for training.
Hardware Specification Yes Training runs on 32 V100 GPUs with a global batch size of 128 for 150K steps. ... For vision-based agent tasks, we fine-tune with 2 A100 GPUs, applying Lo RA (Hu et al., 2021) tuning (rank 8, alpha 16) for the language model.
Software Dependencies No The paper mentions using the Adam W optimizer and LoRA tuning, which are methods. It does not provide specific version numbers for software libraries or programming languages like Python, PyTorch, or CUDA.
Experiment Setup Yes Training runs on 32 V100 GPUs with a global batch size of 128 for 150K steps. The Adam W (Loshchilov, 2017) optimizer is used with β1 = 0.9, β2 = 0.98, and a weight decay of 1e-4. A Cosine Annealing scheduler manages the learning rate, starting with a warm-up over 200 steps. The max learning rate is 1e-3, dropping to 5e-5. Hyperparameters for the training objective are set as λmse = 10, Ll1 = 5, and λGIo U = 2. For vision-based agent tasks, we fine-tune with 2 A100 GPUs, applying Lo RA (Hu et al., 2021) tuning (rank 8, alpha 16) for the language model. ... The training objective hyperparameters are λtext = 1 and Lgrd = 1.