Lightweight Neural App Control

Authors: Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye HAO, Jun Wang, Kun Shao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Li MAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o.
Researcher Affiliation Collaboration 1Huawei Noah s Ark Lab, 2Tianjin University, 3AI Centre, University College London
Pseudocode No The paper describes the methodology and architecture in natural language and with diagrams (Figure 1 and Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for their methodology, nor does it provide a link to a code repository.
Open Datasets Yes Our experiments focus on two open-source mobile phone control datasets, Android Control (Li et al., 2024a) and Android-in-the-Wild (Ait W) (Rawles et al., 2023).
Dataset Splits No We evaluate on the test set of two datasets, using the same process for all models, with only the observation format and model calling differing. Li MAC is trained on just 13K and 18K episodes for Android Control and Ait W, respectively.
Hardware Specification No The paper mentions "computational constraints inherent to smartphones" and refers to "modern devices" in the context of model deployment, but it does not specify any hardware details (e.g., GPU/CPU models, memory) used for conducting its experiments or training the models.
Software Dependencies No The paper mentions using "Ac T is a compact transformer based on GPT-2 architecture" and "The Adam W optimiser (Loshchilov et al., 2017)", and fine-tuning "Florence2" and "Qwen2-VL" using "Lo RA adapters (Hu et al., 2021)". However, it does not provide specific version numbers for any programming languages, libraries, frameworks, or solvers used in their implementation.
Experiment Setup Yes Ac T is a compact transformer based on GPT-2 architecture. The transformer consists of 24 layers and 16 heads per layer. The hidden dimension of the transformer is 1024. We apply a dropout rate of 0.3 (Srivastava et al., 2014) during training across all layers. The Adam W optimiser (Loshchilov et al., 2017) is used in all experiments, with a learning rate of 3 10 4 specifically for Ac T. The functions ftype and ftarget are implemented as two-layer fully connected networks, each with a hidden size of 4096 and a dropout rate of 0.3. We use a batch size of 1 with gradient accumulation being set to 32. We fine-tune Florence2 for 10 epochs, starting with an initial learning rate of 10 6, which is gradually reduced to zero during training. The batch size is set to 2, with gradient accumulation configured to 8. For Qwen2-VL, we employ Lo RA with a dimensionality of 64, beginning with an initial learning rate of 10 4, also gradually decreasing to zero throughout training. The batch size for Qwen2-VL is 1, with gradient accumulation similarly set to 8. We fine-tuned Qwen2-VL for 3 epochs.