QMP: Q-switch Mixture of Policies for Multi-Task Behavior Sharing
Authors: Grace Zhang, Ayush Jain, Injune Hwang, Shao-Hua Sun, Joseph Lim
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that QMP s behavioral policy sharing provides complementary gains over many popular MTRL algorithms and outperforms alternative ways to share behaviors in various manipulation, locomotion, and navigation environments. Videos are available at https://qmp-mtrl.github.io/. We evaluate our method in 7 multi-task designs in manipulation, navigation, and locomotion environments, shown in Figure 6. |
| Researcher Affiliation | Academia | Grace Zhang1 Ayush Jain1 Injune Hwang2 Shao-Hua Sun3 Joseph J. Lim2 1University of Southern California 2KAIST 3National Taiwan University |
| Pseudocode | Yes | Algorithm 1 Q-switch Mixture of Policies (QMP) Input: Task Set {T1, . . . , TN} Initialize {πi}N i=1, {Qi}N i=1, Data buffers {Di}N i=1 for each epoch do for i = 1 to N do while Task Ti episode not terminated do Observe state s Compute πmix i as in Eq. 3. Take action proposal from a πmix i Di Di (s, a, ri, s ) end while end for for i = 1 to N do Update πi, Qi using Di with SAC end for end for Output: Trained policies {πi}N i=1 |
| Open Source Code | Yes | Our code is available here https://github.com/clvrai/qmp. |
| Open Datasets | Yes | We implement multistage reacher tasks on the Open AI Gym (Brockman et al., 2016) Reacher environment simulated in the Mu Jo Co physics engine (Todorov et al., 2012) ... The layout and dynamics of the maze follow Fu et al. (2020)... For Meta-World CDS, we reproduce the Meta-world environment proposed by Yu et al. (2021) using the Meta-world codebase (Yu et al., 2019). |
| Dataset Splits | No | The paper describes the setup for various multi-task environments (Multistage Reacher, Maze Navigation, Meta-World, Walker2D, Kitchen) and defines the tasks within them. It mentions '10 episodes per task, and 5 seeds' for evaluation metrics, but does not provide specific training/validation/test splits for any dataset, which is common in reinforcement learning where data is generated through interaction rather than from pre-split static datasets. |
| Hardware Specification | Yes | We run the experiments primarily on machines with either NVIDIA Ge Force RTX 2080 Ti or RTX 3090. |
| Software Dependencies | No | We used Py Torch (Paszke et al., 2019) for our implementation... We use the Weights & Biases tool (Biewald, 2020) for logging and tracking experiments. All the environments were developed using the Open AI Gym interface (Brockman et al., 2016). The paper cites these tools with years but does not provide specific version numbers for the software (e.g., 'PyTorch 1.9' or 'Gym 0.26'), only the citation year of a related paper/release. |
| Experiment Setup | Yes | H.1 HYPERPARAMETERS Table 5 details the list of important hyperparameters on all the 3 environments. For all environments, we used a 2 layer fully connected network with hidden dimension 256 and a tanh activation function for the policies and Q functions. We use a target network for the Q function with target update τ = 0.995 and trained with an RL discount of γ = 0.99. Table 5: Hyperparameters. Minimum buffer size (per task), # Environment steps per update (per task), # Gradient steps per update (per task), Batch size, Learning rates for π, Q and α. |