reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sample Complexity of Variance-Reduced Distributionally Robust Q-Learning

Authors: Shengbo Wang, Nian Si, Jose Blanchet, Zhengyuan Zhou

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Additionally, a series of numerical experiments conﬁrm the theoretical ﬁndings and the eﬃciency of the algorithms in handling distributional shifts. Keywords: sample complexity, reinforcement learning, distributional robustness, Qlearning, stochastic approximation. 4. Numerical Experiments This section presents a numerical validation of our theoretical ﬁndings regarding the convergence properties of the proposed algorithms.
Researcher Affiliation	Academia	Shengbo Wang EMAIL Department of Management Science and Engineering Stanford University Nian Si EMAIL Department of Industrial Engineering and Decision Analytics The Hong Kong University of Science and Technology Jose Blanchet EMAIL Department of Management Science and Engineering Stanford University Zhengyuan Zhou EMAIL Stern School of Business New York University
Pseudocode	Yes	Algorithm 1 Distributionally Robust Q-Learning Input: the total times of iteration k0 and a batch size n0. Initialization: q1 0; k = 1. for 1 k k0 do Sample Tk+1 the n0-sample empirical DR Bellman operator as in Deﬁnition 5. Compute the Q-learning update qk+1 = (1 λk)qk + λk Tk+1(qk) (3.4) with stepsize λk = 1/(1 + (1 γ)k). end for return qk0+1. ... Algorithm 2 Variance-Reduced Distributionally Robust Q-Learning Input: the number of epochs lvr, a sequence of recentering sample size {ml}lvr l=1, an epoch length kvr and a batch size nvr. Initialization: ˆq0 0; l = 1; k = 1. for 1 l lvr do Compute e Tl, ml-sample empirical DR Bellman operator as in Deﬁnition 5. Set ql,1 = ˆql 1. for 1 k kvr do Sample Tl,k+1 an nvr-sample empirical Bellman operator. Compute the recentered Q-learning update ql,k+1 = (1 λk)ql,k + λk Tl,k+1(ql,k) Tl,k+1(ˆql 1) + e Tl(ˆql 1) (3.5) with stepsize λk = 1/(1 + (1 γ)k). end for Set ˆql = ql,kvr+1. end for return ˆqlvr
Open Source Code	No	The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. The license information provided is for the paper itself, not the code.
Open Datasets	No	This MDP has 4 states and 2 actions, with transition probabilities given for actions 1 and 2 labeled on the arrows between states. Constructed in Li et al. (2021), it is shown in that when p = 4γ 1 3γ , standard non-robust Q-learning will have a sample complexity of Θ((1 γ) 4ϵ 2). ... First, we introduce a family of MDPs instance. Deﬁne reference MPDs with S = {1, 2}, A = {a1, a2}, transition kernel P0,a1 = P0,a2 = 1/2 1/2 1/2 1/2. The paper primarily uses synthetic or custom-constructed MDP instances for its numerical experiments, as described in Sections 4.1 and 4.2. It does not mention using or providing access to any external, pre-existing publicly available datasets in the conventional sense (e.g., ImageNet, MNIST).
Dataset Splits	No	The paper uses custom-constructed Markov Decision Process (MDP) instances for simulation-based experiments, not traditional datasets that would be subject to training/test/validation splits. Therefore, the concept of dataset splits as typically understood for supervised learning or other data-driven tasks is not applicable or discussed in the context of these experiments. The paper describes the MDP environments, but not data splits from those environments.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the numerical experiments, such as CPU or GPU models, memory, or cloud computing specifications. The section '4. Numerical Experiments' describes the experimental setup in terms of algorithms and parameters but omits hardware information.
Software Dependencies	No	The paper does not explicitly list any software dependencies with specific version numbers. While it discusses algorithms like Q-learning and their variants, it does not provide details on the programming languages, libraries, or frameworks and their versions used for implementation.
Experiment Setup	Yes	Figures 2a and 2b depict the convergence properties of the two algorithms for γ = {0.93, 0.95} and δ = 0.1. These ﬁgures show the (4000 samples) averaged error of the output q-function in the inﬁnity norm plotted against the (4000 samples) averaged number of samples used, both on a log-log scale. The parameters for DR Q-learning in Figure 2a are set according to 8. On the other hand, Figure 2b plots the averaged error achieved by the variance-reduced algorithm after each epoch against the total number of samples used. In the subsequent developments, we use ml = 2l(1 γ) 2 for the variance-reduced Algorithm 2. As explained in Remark 12, this choice (up to a log factor) yields the same complexity guarantee as stated in Theorem 13.