reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

QMP: Q-switch Mixture of Policies for Multi-Task Behavior Sharing

Authors: Grace Zhang, Ayush Jain, Injune Hwang, Shao-Hua Sun, Joseph Lim

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that QMP s behavioral policy sharing provides complementary gains over many popular MTRL algorithms and outperforms alternative ways to share behaviors in various manipulation, locomotion, and navigation environments. Videos are available at https://qmp-mtrl.github.io/. We evaluate our method in 7 multi-task designs in manipulation, navigation, and locomotion environments, shown in Figure 6.
Researcher Affiliation	Academia	Grace Zhang1 Ayush Jain1 Injune Hwang2 Shao-Hua Sun3 Joseph J. Lim2 1University of Southern California 2KAIST 3National Taiwan University
Pseudocode	Yes	Algorithm 1 Q-switch Mixture of Policies (QMP) Input: Task Set {T1, . . . , TN} Initialize {πi}N i=1, {Qi}N i=1, Data buffers {Di}N i=1 for each epoch do for i = 1 to N do while Task Ti episode not terminated do Observe state s Compute πmix i as in Eq. 3. Take action proposal from a πmix i Di Di (s, a, ri, s ) end while end for for i = 1 to N do Update πi, Qi using Di with SAC end for end for Output: Trained policies {πi}N i=1
Open Source Code	Yes	Our code is available here https://github.com/clvrai/qmp.
Open Datasets	Yes	We implement multistage reacher tasks on the Open AI Gym (Brockman et al., 2016) Reacher environment simulated in the Mu Jo Co physics engine (Todorov et al., 2012) ... The layout and dynamics of the maze follow Fu et al. (2020)... For Meta-World CDS, we reproduce the Meta-world environment proposed by Yu et al. (2021) using the Meta-world codebase (Yu et al., 2019).
Dataset Splits	No	The paper describes the setup for various multi-task environments (Multistage Reacher, Maze Navigation, Meta-World, Walker2D, Kitchen) and defines the tasks within them. It mentions '10 episodes per task, and 5 seeds' for evaluation metrics, but does not provide specific training/validation/test splits for any dataset, which is common in reinforcement learning where data is generated through interaction rather than from pre-split static datasets.
Hardware Specification	Yes	We run the experiments primarily on machines with either NVIDIA Ge Force RTX 2080 Ti or RTX 3090.
Software Dependencies	No	We used Py Torch (Paszke et al., 2019) for our implementation... We use the Weights & Biases tool (Biewald, 2020) for logging and tracking experiments. All the environments were developed using the Open AI Gym interface (Brockman et al., 2016). The paper cites these tools with years but does not provide specific version numbers for the software (e.g., 'PyTorch 1.9' or 'Gym 0.26'), only the citation year of a related paper/release.
Experiment Setup	Yes	H.1 HYPERPARAMETERS Table 5 details the list of important hyperparameters on all the 3 environments. For all environments, we used a 2 layer fully connected network with hidden dimension 256 and a tanh activation function for the policies and Q functions. We use a target network for the Q function with target update τ = 0.995 and trained with an RL discount of γ = 0.99. Table 5: Hyperparameters. Minimum buffer size (per task), # Environment steps per update (per task), # Gradient steps per update (per task), Batch size, Learning rates for π, Q and α.