reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bayesian Policy Gradient and Actor-Critic Algorithms

Authors: Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, as well as with each other, on a number of reinforcement learning problems. [...] In this section, we compare the Bayesian quadrature (BQ) and the plain MC gradient estimates on a simple bandit problem as well as on a continuous state and action linear quadratic regulator (LQR). We also evaluate the performance of the Bayesian policy gradient (BPG) algorithm described in Algorithm 2 on the LQR, and compare it with a Monte-Carlo based policy gradient (MCPG) algorithm. [...] In this section, we empirically17 evaluate the performance of the Bayesian actor-critic method presented in this paper in a 10-state random walk problem as well as in the widely used continuous-state-space mountain car problem (Sutton and Barto, 1998) and ship steering problem (Miller et al., 1990).
Researcher Affiliation	Collaboration	Mohammad Ghavamzadeh EMAIL Adobe Research & INRIA Yaakov Engel EMAIL Rafael Advanced Defence System, Israel Michal Valko EMAIL INRIA Lille Seque L team, France
Pseudocode	Yes	Algorithm 1 A Bayesian Policy Gradient Evaluation Algorithm Algorithm 2 A Bayesian Policy Gradient Algorithm Algorithm 3 Fisher Information Matrix Estimation Algorithm Algorithm 4 A Bayesian Actor-Critic Algorithm
Open Source Code	Yes	17. The code for all the experiments of this section is available at https://sequel.lille.inria.fr/Software/BAC.
Open Datasets	Yes	We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, as well as with each other, on a number of reinforcement learning problems. [...] In this section, we empirically17 evaluate the performance of the Bayesian actor-critic method presented in this paper in a 10-state random walk problem as well as in the widely used continuous-state-space mountain car problem (Sutton and Barto, 1998) and ship steering problem (Miller et al., 1990).
Dataset Splits	No	The paper describes generating 'M' sample paths or episodes for gradient estimation and averaging results over multiple runs (e.g., 10^4 runs, 10^3 runs, 100 independent learning trials). While this indicates how data is sampled and evaluated within a reinforcement learning context, it does not specify fixed training, validation, or test dataset splits in the traditional sense, as the environments are simulated and episodes are generated dynamically.
Hardware Specification	No	Part of the computational experiments was conducted using the Grid 5000 experimental testbed (https://www.grid5000.fr). While this indicates a computing resource was used, it does not provide specific hardware details such as GPU models, CPU types, or memory amounts.
Software Dependencies	No	The paper does not explicitly mention any specific software dependencies or library versions used for implementation (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup	Yes	We use Algorithm 2 with the number of updates set to N = 100, and Model 1 with sparsiﬁcation for the BPG and BPNG methods. [...] The policy parameters are initialized randomly at each run. In order to ensure that the learned parameters do not exceed an acceptable range, the policy parameters are deﬁned as λ = 1.999 + 1.998/(1 + eκ1) and σ = 0.001 + 1/(1 + eκ2). [...] We used two diﬀerent learning rates for the two components of the gradient. For a ﬁxed sample size, BPG and MCPG methods start with an initial learning rate and decrease it according to the schedule βj = β0 20/(20 + j) . The BPNG algorithm uses a ﬁxed learning rate multiplied by the determinant of the Fisher information matrix. We tried many values for the initial learning rates used by these algorithms and those in Table 3 yielded the best performance of those we tried.