Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
Authors: Zhangheng LI, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Moorthy, Jeffrey Nichols, Yinfei Yang, Zhe Gan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities. |
| Researcher Affiliation | Collaboration | Zhangheng Li , Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan Apple EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Adaptive N-gridding |
| Open Source Code | No | The paper does not explicitly state that the source code for their methodology is open-sourced or provide a link to a code repository. |
| Open Datasets | Yes | The web data is derived from the Web UI dataset (Wu et al., 2023). The Android data for screenshots, bounding boxes, and text annotations is transformed from the RICO dataset (Deka et al., 2017). Besides, we also employ third-party training datasets to enrich our data source and avoid overfitting our predefined tasks. A complete statistics of the training dataset of Ferret-UI 2 is summarized in Table 1, indicating that the dataset distribution is very unbalanced across different platforms. In particular, the number of i Pad and Apple TV screenshots is significantly smaller than that of other platforms. [...] we augment training data with additional third-party datasets, including Ground UI-18k (Zheng et al., 2024b), GUIDE (Chawla et al., 2024) and Spotlight (Li & Li, 2023). |
| Dataset Splits | No | The paper mentions that for evaluation, they created |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments. It mentions models like CLIP ViT-L/14, Vicuna-13B, Gemma-2B, and Llama3-8B, but these are software models, not hardware specifications. |
| Software Dependencies | No | The paper mentions several models and frameworks such as GPT-4o, Ferret-UI, CLIP, Vicuna-13B, Gemma-2B, and Llama3-8B. However, it does not provide specific version numbers for general software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or CUDA versions. |
| Experiment Setup | Yes | As to dynamic high-resolution image encoding, we set the size limit N to 8, so that the maximal grid number is 16 for adaptive gridding. [...] we (i) assign different loss weights to different platforms during training, and (ii) generate all three types of advanced tasks for each example of i Pad and Apple TV platforms and generate only 1 type of advanced task for each example of other platforms. |