reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Wasserstein Policy Optimization

Authors: David Pfau, Ian Davies, Diana L Borsa, João Guilherme Madeira Araújo, Brendan Daniel Tracey, Hado Van Hasselt

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show results on the Deep Mind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods. An open-source implementation of WPO is available in Acme (Hoffman et al., 2020)1. We compare its performance against several baseline methods on the Deep Mind Control Suite (Tassa et al., 2018; Tunyasuvunakool et al., 2020) and a task controlling magnetic coils in a simulated magnetic confinement fusion device (Tracey et al., 2024).
Researcher Affiliation	Industry	1Google Deep Mind, London, UK. Correspondence to: David Pfau <EMAIL>.
Pseudocode	Yes	Algorithm 1 WPO with Replay and n-step TD critic learning for multi-dimensional Gaussian Policy
Open Source Code	Yes	An open-source implementation of WPO is available in Acme (Hoffman et al., 2020)1. 1https://github.com/google-deepmind/acme. Note that the implementation in Acme is not the version used for the experiments in this paper. However we have run this implementation on Deep Mind Control Suite tasks and found qualitatively similar performance to that reported here.
Open Datasets	Yes	To evaluate the effectiveness of WPO, we evaluate it on the Deep Mind Control Suite (Tassa et al., 2018; Tunyasuvunakool et al., 2020), a set of tasks in Mu Jo Co (Todorov et al., 2012). These tasks vary from one-dimensional actions, like swinging a pendulum, up to a 56-Do F humanoid. We additionally consider magnetic control of a tokamak plasma in simulation, a problem originally tackled by MPO in Degrave et al. (2022).
Dataset Splits	No	The paper describes using a replay buffer and sampling mini-batches for training, and mentions running experiments on tasks from the Deep Mind Control Suite. However, it does not provide specific details on how the data for these tasks (or any other datasets used) are split into training, validation, or test sets with percentages, sample counts, or explicit methodology for reproduction beyond general RL data generation.
Hardware Specification	No	Our training setup is similar to other distributed RL systems (Hoffman et al., 2020): we run 4 actors in parallel to generate training data for the Control Suite tasks, and 1000 actors for the tokamak task. No specific details about the type of CPUs, GPUs, or other hardware components used were provided.
Software Dependencies	No	The paper mentions several software components like Acme (Hoffman et al., 2020), MuJoCo (Todorov et al., 2012), and the FGE Grad-Schafranov simulator (Carpanese, 2021). However, it does not specify version numbers for these, or any other critical software dependencies like Python, TensorFlow, or PyTorch versions, which are necessary for reproducible results.
Experiment Setup	Yes	Our training setup is similar to other distributed RL systems (Hoffman et al., 2020): we run 4 actors in parallel to generate training data for the Control Suite tasks, and 1000 actors for the tokamak task. Training hyperparameters are listed in Sec. B and an outline of the full training loop is given in Alg. 1 in the appendix. For each RL algorithm, the same hyperparameters were used for each control suite environment. Section B provides detailed hyperparameters in Table 1 (Common Hyperparameters), Table 2 (WPO Hyperparameters), Table 3 (DDPG Hyperparameters), Table 4 (MPO Hyperparameters), and Table 5 (SAC Hyperparameters).