Bayesian Policy Gradient and Actor-Critic Algorithms

Authors: Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, as well as with each other, on a number of reinforcement learning problems. [...] In this section, we compare the Bayesian quadrature (BQ) and the plain MC gradient estimates on a simple bandit problem as well as on a continuous state and action linear quadratic regulator (LQR). We also evaluate the performance of the Bayesian policy gradient (BPG) algorithm described in Algorithm 2 on the LQR, and compare it with a Monte-Carlo based policy gradient (MCPG) algorithm. [...] In this section, we empirically17 evaluate the performance of the Bayesian actor-critic method presented in this paper in a 10-state random walk problem as well as in the widely used continuous-state-space mountain car problem (Sutton and Barto, 1998) and ship steering problem (Miller et al., 1990).
Researcher Affiliation Collaboration Mohammad Ghavamzadeh EMAIL Adobe Research & INRIA Yaakov Engel EMAIL Rafael Advanced Defence System, Israel Michal Valko EMAIL INRIA Lille Seque L team, France
Pseudocode Yes Algorithm 1 A Bayesian Policy Gradient Evaluation Algorithm Algorithm 2 A Bayesian Policy Gradient Algorithm Algorithm 3 Fisher Information Matrix Estimation Algorithm Algorithm 4 A Bayesian Actor-Critic Algorithm
Open Source Code Yes 17. The code for all the experiments of this section is available at https://sequel.lille.inria.fr/Software/BAC.
Open Datasets Yes We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, as well as with each other, on a number of reinforcement learning problems. [...] In this section, we empirically17 evaluate the performance of the Bayesian actor-critic method presented in this paper in a 10-state random walk problem as well as in the widely used continuous-state-space mountain car problem (Sutton and Barto, 1998) and ship steering problem (Miller et al., 1990).
Dataset Splits No The paper describes generating 'M' sample paths or episodes for gradient estimation and averaging results over multiple runs (e.g., 10^4 runs, 10^3 runs, 100 independent learning trials). While this indicates how data is sampled and evaluated within a reinforcement learning context, it does not specify fixed training, validation, or test dataset splits in the traditional sense, as the environments are simulated and episodes are generated dynamically.
Hardware Specification No Part of the computational experiments was conducted using the Grid 5000 experimental testbed (https://www.grid5000.fr). While this indicates a computing resource was used, it does not provide specific hardware details such as GPU models, CPU types, or memory amounts.
Software Dependencies No The paper does not explicitly mention any specific software dependencies or library versions used for implementation (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes We use Algorithm 2 with the number of updates set to N = 100, and Model 1 with sparsification for the BPG and BPNG methods. [...] The policy parameters are initialized randomly at each run. In order to ensure that the learned parameters do not exceed an acceptable range, the policy parameters are defined as λ = 1.999 + 1.998/(1 + eκ1) and σ = 0.001 + 1/(1 + eκ2). [...] We used two different learning rates for the two components of the gradient. For a fixed sample size, BPG and MCPG methods start with an initial learning rate and decrease it according to the schedule βj = β0 20/(20 + j) . The BPNG algorithm uses a fixed learning rate multiplied by the determinant of the Fisher information matrix. We tried many values for the initial learning rates used by these algorithms and those in Table 3 yielded the best performance of those we tried.