Policy Gradient with Kernel Quadrature

Authors: Satoshi Hayakawa, Tetsuro Morimura

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present the theoretical background of this procedure as well as its numerical illustrations in Mu Jo Co tasks. ... To demonstrate the effectiveness of our proposed methods, we conducted experiments on Mu Jo Co tasks since they are widely recognized as standard benchmarks in RL, even though the reward calculation for them is lightweight.
Researcher Affiliation Collaboration Satoshi Hayakawa EMAIL Mathematical Institute, University of Oxford Tetsuro Morimura EMAIL Cyber Agent, Inc.
Pseudocode Yes Algorithm 1 Policy gradient Algorithm 2 Vanilla PGKQ Algorithm 3 PGKQ with non-centered GP
Open Source Code No The paper mentions using a third-party library ('machina3 library' with a GitHub link) and the implementation of specific components ('mψ and kψ') but does not provide a direct link or explicit statement about the availability of the authors' own source code for the PGKQ methodology described.
Open Datasets Yes We used Mu Jo Co (v2.1.0, Todorov et al., 2012) with the Gymnasium API (Towers et al., 2023).
Dataset Splits No The paper mentions batch sizes (N=64 and n=8 episodes) and maximum episode length (1000) but does not provide specific details on how datasets were split into training, validation, or testing sets for experimental reproduction.
Hardware Specification Yes All the experiments with Mu Jo Co were conducted with a Google Cloud Vertex AI notebook with an NVIDIA T4 (16-core v CPU, 60 GB RAM).
Software Dependencies No All the experiments were conducted by using Py Torch (Paszke et al., 2019) and Adam (Kingma & Ba, 2015). ... We used the implementation of the machina3 library.
Experiment Setup Yes The learning rates of the policy, baseline, and GP-related networks were all set to 3 × 10−4. ... The discount rate was γ = 0.995. ... In all the experiments, we used three-layer fully connected Re LU neural networks (NNs) for each of mψ and kψ, where kψ(z, z ) was computed by passing the NN-embeddings of state-action pairs z and z to the Gaussian kernel with additional scale and noise parameters.