Policy Gradient with Kernel Quadrature
Authors: Satoshi Hayakawa, Tetsuro Morimura
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present the theoretical background of this procedure as well as its numerical illustrations in Mu Jo Co tasks. ... To demonstrate the effectiveness of our proposed methods, we conducted experiments on Mu Jo Co tasks since they are widely recognized as standard benchmarks in RL, even though the reward calculation for them is lightweight. |
| Researcher Affiliation | Collaboration | Satoshi Hayakawa EMAIL Mathematical Institute, University of Oxford Tetsuro Morimura EMAIL Cyber Agent, Inc. |
| Pseudocode | Yes | Algorithm 1 Policy gradient Algorithm 2 Vanilla PGKQ Algorithm 3 PGKQ with non-centered GP |
| Open Source Code | No | The paper mentions using a third-party library ('machina3 library' with a GitHub link) and the implementation of specific components ('mψ and kψ') but does not provide a direct link or explicit statement about the availability of the authors' own source code for the PGKQ methodology described. |
| Open Datasets | Yes | We used Mu Jo Co (v2.1.0, Todorov et al., 2012) with the Gymnasium API (Towers et al., 2023). |
| Dataset Splits | No | The paper mentions batch sizes (N=64 and n=8 episodes) and maximum episode length (1000) but does not provide specific details on how datasets were split into training, validation, or testing sets for experimental reproduction. |
| Hardware Specification | Yes | All the experiments with Mu Jo Co were conducted with a Google Cloud Vertex AI notebook with an NVIDIA T4 (16-core v CPU, 60 GB RAM). |
| Software Dependencies | No | All the experiments were conducted by using Py Torch (Paszke et al., 2019) and Adam (Kingma & Ba, 2015). ... We used the implementation of the machina3 library. |
| Experiment Setup | Yes | The learning rates of the policy, baseline, and GP-related networks were all set to 3 × 10−4. ... The discount rate was γ = 0.995. ... In all the experiments, we used three-layer fully connected Re LU neural networks (NNs) for each of mψ and kψ, where kψ(z, z ) was computed by passing the NN-embeddings of state-action pairs z and z to the Gaussian kernel with additional scale and noise parameters. |