reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Lightweight Neural App Control

Authors: Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye HAO, Jun Wang, Kun Shao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Li MAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o.
Researcher Affiliation	Collaboration	1Huawei Noah s Ark Lab, 2Tianjin University, 3AI Centre, University College London
Pseudocode	No	The paper describes the methodology and architecture in natural language and with diagrams (Figure 1 and Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for their methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	Our experiments focus on two open-source mobile phone control datasets, Android Control (Li et al., 2024a) and Android-in-the-Wild (Ait W) (Rawles et al., 2023).
Dataset Splits	No	We evaluate on the test set of two datasets, using the same process for all models, with only the observation format and model calling differing. Li MAC is trained on just 13K and 18K episodes for Android Control and Ait W, respectively.
Hardware Specification	No	The paper mentions "computational constraints inherent to smartphones" and refers to "modern devices" in the context of model deployment, but it does not specify any hardware details (e.g., GPU/CPU models, memory) used for conducting its experiments or training the models.
Software Dependencies	No	The paper mentions using "Ac T is a compact transformer based on GPT-2 architecture" and "The Adam W optimiser (Loshchilov et al., 2017)", and fine-tuning "Florence2" and "Qwen2-VL" using "Lo RA adapters (Hu et al., 2021)". However, it does not provide specific version numbers for any programming languages, libraries, frameworks, or solvers used in their implementation.
Experiment Setup	Yes	Ac T is a compact transformer based on GPT-2 architecture. The transformer consists of 24 layers and 16 heads per layer. The hidden dimension of the transformer is 1024. We apply a dropout rate of 0.3 (Srivastava et al., 2014) during training across all layers. The Adam W optimiser (Loshchilov et al., 2017) is used in all experiments, with a learning rate of 3 10 4 specifically for Ac T. The functions ftype and ftarget are implemented as two-layer fully connected networks, each with a hidden size of 4096 and a dropout rate of 0.3. We use a batch size of 1 with gradient accumulation being set to 32. We fine-tune Florence2 for 10 epochs, starting with an initial learning rate of 10 6, which is gradually reduced to zero during training. The batch size is set to 2, with gradient accumulation configured to 8. For Qwen2-VL, we employ Lo RA with a dimensionality of 64, beginning with an initial learning rate of 10 4, also gradually decreasing to zero throughout training. The batch size for Qwen2-VL is 1, with gradient accumulation similarly set to 8. We fine-tuned Qwen2-VL for 3 epochs.