Bayesian Policy Gradient and Actor-Critic Algorithms
Authors: Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko
JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, as well as with each other, on a number of reinforcement learning problems. [...] In this section, we compare the Bayesian quadrature (BQ) and the plain MC gradient estimates on a simple bandit problem as well as on a continuous state and action linear quadratic regulator (LQR). We also evaluate the performance of the Bayesian policy gradient (BPG) algorithm described in Algorithm 2 on the LQR, and compare it with a Monte-Carlo based policy gradient (MCPG) algorithm. [...] In this section, we empirically17 evaluate the performance of the Bayesian actor-critic method presented in this paper in a 10-state random walk problem as well as in the widely used continuous-state-space mountain car problem (Sutton and Barto, 1998) and ship steering problem (Miller et al., 1990). |
| Researcher Affiliation | Collaboration | Mohammad Ghavamzadeh EMAIL Adobe Research & INRIA Yaakov Engel EMAIL Rafael Advanced Defence System, Israel Michal Valko EMAIL INRIA Lille Seque L team, France |
| Pseudocode | Yes | Algorithm 1 A Bayesian Policy Gradient Evaluation Algorithm Algorithm 2 A Bayesian Policy Gradient Algorithm Algorithm 3 Fisher Information Matrix Estimation Algorithm Algorithm 4 A Bayesian Actor-Critic Algorithm |
| Open Source Code | Yes | 17. The code for all the experiments of this section is available at https://sequel.lille.inria.fr/Software/BAC. |
| Open Datasets | Yes | We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, as well as with each other, on a number of reinforcement learning problems. [...] In this section, we empirically17 evaluate the performance of the Bayesian actor-critic method presented in this paper in a 10-state random walk problem as well as in the widely used continuous-state-space mountain car problem (Sutton and Barto, 1998) and ship steering problem (Miller et al., 1990). |
| Dataset Splits | No | The paper describes generating 'M' sample paths or episodes for gradient estimation and averaging results over multiple runs (e.g., 10^4 runs, 10^3 runs, 100 independent learning trials). While this indicates how data is sampled and evaluated within a reinforcement learning context, it does not specify fixed training, validation, or test dataset splits in the traditional sense, as the environments are simulated and episodes are generated dynamically. |
| Hardware Specification | No | Part of the computational experiments was conducted using the Grid 5000 experimental testbed (https://www.grid5000.fr). While this indicates a computing resource was used, it does not provide specific hardware details such as GPU models, CPU types, or memory amounts. |
| Software Dependencies | No | The paper does not explicitly mention any specific software dependencies or library versions used for implementation (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | We use Algorithm 2 with the number of updates set to N = 100, and Model 1 with sparsification for the BPG and BPNG methods. [...] The policy parameters are initialized randomly at each run. In order to ensure that the learned parameters do not exceed an acceptable range, the policy parameters are defined as λ = 1.999 + 1.998/(1 + eκ1) and σ = 0.001 + 1/(1 + eκ2). [...] We used two different learning rates for the two components of the gradient. For a fixed sample size, BPG and MCPG methods start with an initial learning rate and decrease it according to the schedule βj = β0 20/(20 + j) . The BPNG algorithm uses a fixed learning rate multiplied by the determinant of the Fisher information matrix. We tried many values for the initial learning rates used by these algorithms and those in Table 3 yielded the best performance of those we tried. |