reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Authors: Zhangheng LI, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Moorthy, Jeffrey Nichols, Yinfei Yang, Zhe Gan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.
Researcher Affiliation	Collaboration	Zhangheng Li , Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan Apple EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Adaptive N-gridding
Open Source Code	No	The paper does not explicitly state that the source code for their methodology is open-sourced or provide a link to a code repository.
Open Datasets	Yes	The web data is derived from the Web UI dataset (Wu et al., 2023). The Android data for screenshots, bounding boxes, and text annotations is transformed from the RICO dataset (Deka et al., 2017). Besides, we also employ third-party training datasets to enrich our data source and avoid overfitting our predefined tasks. A complete statistics of the training dataset of Ferret-UI 2 is summarized in Table 1, indicating that the dataset distribution is very unbalanced across different platforms. In particular, the number of i Pad and Apple TV screenshots is significantly smaller than that of other platforms. [...] we augment training data with additional third-party datasets, including Ground UI-18k (Zheng et al., 2024b), GUIDE (Chawla et al., 2024) and Spotlight (Li & Li, 2023).
Dataset Splits	No	The paper mentions that for evaluation, they created
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments. It mentions models like CLIP ViT-L/14, Vicuna-13B, Gemma-2B, and Llama3-8B, but these are software models, not hardware specifications.
Software Dependencies	No	The paper mentions several models and frameworks such as GPT-4o, Ferret-UI, CLIP, Vicuna-13B, Gemma-2B, and Llama3-8B. However, it does not provide specific version numbers for general software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup	Yes	As to dynamic high-resolution image encoding, we set the size limit N to 8, so that the maximal grid number is 16 for adaptive gridding. [...] we (i) assign different loss weights to different platforms during training, and (ii) generate all three types of advanced tasks for each example of i Pad and Apple TV platforms and generate only 1 type of advanced task for each example of other platforms.