Digi-Q: Learning VLM Q-Value Functions for Training Device-Control Agents

Authors: Hao Bai, Yifei Zhou, Li Li, Sergey Levine, Aviral Kumar

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The goal of our experiments is to evaluate the efficacy of Digi-Q in producing effective Q-functions that in turn are able to train strong Android device control agents. Our experiments will answer the following questions: (1) How does Digi-Q compare with other state-of-the-art agent training algorithms, previously studied in the context of Android device control tasks? and (2) Can Digi-Q learn effectively from past interaction data? In addition, we perform several ablation experiments to understand the effects of various components of Digi-Q: to understand the benefits of using representation fine-tuning and to validate the efficacy of the Best-of-N reranking approach for training the policy using the value function.
Researcher Affiliation Collaboration Hao Bai UIUC Yifei Zhou UC Berkeley Erran Li Amazon Sergey Levine UC Berkeley Aviral Kumar CMU Equal Contribution, correspondance to: EMAIL, yifei EMAIL
Pseudocode Yes A DETAILS ON THE ALGORITHM For completeness, we include a detailed pseudo-code of Digi-Q in Algorithm 1. Algorithm 1 Digi-Q: Practical Framework
Open Source Code Yes The project is open-sourced at https://github.com/Digi RL-agent/digiq
Open Datasets Yes We evaluate our results on Android-in-the-Wild (Ait W) with offline dataset containing 1296 trajectories for Ait W Web Shopping subset and 1008 trajectories from Ait W General subset from pre-trained Auto UI checkpoint, following Bai et al. (2024).
Dataset Splits No The paper mentions that it evaluates results with the autonomous evaluator with the first 96 instructions in the train and test set, and collects 1296 trajectories for Webshop and 1008 for General subsets. However, it does not explicitly provide the specific percentages or counts for the training, validation, and test splits used for the model training process from these total trajectories.
Hardware Specification No We thank Google Cloud for providing Gemini 1.5 Pro credit donations for academic use and some GPU and TPU resources. We also thank the NCSA Delta cluster admins for providing us with GPU resources for training.
Software Dependencies No We encode the text strings with BERT and images with BLIP-2 model. Then we concatenate all these feature vectors and pass them through a MLP that tries to predict the V value. We use LLa Va-1.5 (Liu et al., 2024a) for the backbone VLM for our Qand Vfunctions.
Experiment Setup Yes Hyperparameters for Digi-Q are carefully tuned through binary search on the training set of General and Web Shopping subsets. The final choice of hyperparameters for both methods can be found in Table 4. ... Table 4: Hyperparameters for Digi-Q on both General and Web Shopping subset of Ait W. If multiple values are displayed, the bolded value represents the selected value after hyperparamemter sweeping.