Wasserstein Policy Optimization

Authors: David Pfau, Ian Davies, Diana L Borsa, João Guilherme Madeira Araújo, Brendan Daniel Tracey, Hado Van Hasselt

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show results on the Deep Mind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods. An open-source implementation of WPO is available in Acme (Hoffman et al., 2020)1. We compare its performance against several baseline methods on the Deep Mind Control Suite (Tassa et al., 2018; Tunyasuvunakool et al., 2020) and a task controlling magnetic coils in a simulated magnetic confinement fusion device (Tracey et al., 2024).
Researcher Affiliation Industry 1Google Deep Mind, London, UK. Correspondence to: David Pfau <EMAIL>.
Pseudocode Yes Algorithm 1 WPO with Replay and n-step TD critic learning for multi-dimensional Gaussian Policy
Open Source Code Yes An open-source implementation of WPO is available in Acme (Hoffman et al., 2020)1. 1https://github.com/google-deepmind/acme. Note that the implementation in Acme is not the version used for the experiments in this paper. However we have run this implementation on Deep Mind Control Suite tasks and found qualitatively similar performance to that reported here.
Open Datasets Yes To evaluate the effectiveness of WPO, we evaluate it on the Deep Mind Control Suite (Tassa et al., 2018; Tunyasuvunakool et al., 2020), a set of tasks in Mu Jo Co (Todorov et al., 2012). These tasks vary from one-dimensional actions, like swinging a pendulum, up to a 56-Do F humanoid. We additionally consider magnetic control of a tokamak plasma in simulation, a problem originally tackled by MPO in Degrave et al. (2022).
Dataset Splits No The paper describes using a replay buffer and sampling mini-batches for training, and mentions running experiments on tasks from the Deep Mind Control Suite. However, it does not provide specific details on how the data for these tasks (or any other datasets used) are split into training, validation, or test sets with percentages, sample counts, or explicit methodology for reproduction beyond general RL data generation.
Hardware Specification No Our training setup is similar to other distributed RL systems (Hoffman et al., 2020): we run 4 actors in parallel to generate training data for the Control Suite tasks, and 1000 actors for the tokamak task. No specific details about the type of CPUs, GPUs, or other hardware components used were provided.
Software Dependencies No The paper mentions several software components like Acme (Hoffman et al., 2020), MuJoCo (Todorov et al., 2012), and the FGE Grad-Schafranov simulator (Carpanese, 2021). However, it does not specify version numbers for these, or any other critical software dependencies like Python, TensorFlow, or PyTorch versions, which are necessary for reproducible results.
Experiment Setup Yes Our training setup is similar to other distributed RL systems (Hoffman et al., 2020): we run 4 actors in parallel to generate training data for the Control Suite tasks, and 1000 actors for the tokamak task. Training hyperparameters are listed in Sec. B and an outline of the full training loop is given in Alg. 1 in the appendix. For each RL algorithm, the same hyperparameters were used for each control suite environment. Section B provides detailed hyperparameters in Table 1 (Common Hyperparameters), Table 2 (WPO Hyperparameters), Table 3 (DDPG Hyperparameters), Table 4 (MPO Hyperparameters), and Table 5 (SAC Hyperparameters).