Sample Complexity of Variance-Reduced Distributionally Robust Q-Learning

Authors: Shengbo Wang, Nian Si, Jose Blanchet, Zhengyuan Zhou

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Additionally, a series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts. Keywords: sample complexity, reinforcement learning, distributional robustness, Qlearning, stochastic approximation. 4. Numerical Experiments This section presents a numerical validation of our theoretical findings regarding the convergence properties of the proposed algorithms.
Researcher Affiliation Academia Shengbo Wang EMAIL Department of Management Science and Engineering Stanford University Nian Si EMAIL Department of Industrial Engineering and Decision Analytics The Hong Kong University of Science and Technology Jose Blanchet EMAIL Department of Management Science and Engineering Stanford University Zhengyuan Zhou EMAIL Stern School of Business New York University
Pseudocode Yes Algorithm 1 Distributionally Robust Q-Learning Input: the total times of iteration k0 and a batch size n0. Initialization: q1 0; k = 1. for 1 k k0 do Sample Tk+1 the n0-sample empirical DR Bellman operator as in Definition 5. Compute the Q-learning update qk+1 = (1 λk)qk + λk Tk+1(qk) (3.4) with stepsize λk = 1/(1 + (1 γ)k). end for return qk0+1. ... Algorithm 2 Variance-Reduced Distributionally Robust Q-Learning Input: the number of epochs lvr, a sequence of recentering sample size {ml}lvr l=1, an epoch length kvr and a batch size nvr. Initialization: ˆq0 0; l = 1; k = 1. for 1 l lvr do Compute e Tl, ml-sample empirical DR Bellman operator as in Definition 5. Set ql,1 = ˆql 1. for 1 k kvr do Sample Tl,k+1 an nvr-sample empirical Bellman operator. Compute the recentered Q-learning update ql,k+1 = (1 λk)ql,k + λk Tl,k+1(ql,k) Tl,k+1(ˆql 1) + e Tl(ˆql 1) (3.5) with stepsize λk = 1/(1 + (1 γ)k). end for Set ˆql = ql,kvr+1. end for return ˆqlvr
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. The license information provided is for the paper itself, not the code.
Open Datasets No This MDP has 4 states and 2 actions, with transition probabilities given for actions 1 and 2 labeled on the arrows between states. Constructed in Li et al. (2021), it is shown in that when p = 4γ 1 3γ , standard non-robust Q-learning will have a sample complexity of Θ((1 γ) 4ϵ 2). ... First, we introduce a family of MDPs instance. Define reference MPDs with S = {1, 2}, A = {a1, a2}, transition kernel P0,a1 = P0,a2 = 1/2 1/2 1/2 1/2. The paper primarily uses synthetic or custom-constructed MDP instances for its numerical experiments, as described in Sections 4.1 and 4.2. It does not mention using or providing access to any external, pre-existing publicly available datasets in the conventional sense (e.g., ImageNet, MNIST).
Dataset Splits No The paper uses custom-constructed Markov Decision Process (MDP) instances for simulation-based experiments, not traditional datasets that would be subject to training/test/validation splits. Therefore, the concept of dataset splits as typically understood for supervised learning or other data-driven tasks is not applicable or discussed in the context of these experiments. The paper describes the MDP environments, but not data splits from those environments.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the numerical experiments, such as CPU or GPU models, memory, or cloud computing specifications. The section '4. Numerical Experiments' describes the experimental setup in terms of algorithms and parameters but omits hardware information.
Software Dependencies No The paper does not explicitly list any software dependencies with specific version numbers. While it discusses algorithms like Q-learning and their variants, it does not provide details on the programming languages, libraries, or frameworks and their versions used for implementation.
Experiment Setup Yes Figures 2a and 2b depict the convergence properties of the two algorithms for γ = {0.93, 0.95} and δ = 0.1. These figures show the (4000 samples) averaged error of the output q-function in the infinity norm plotted against the (4000 samples) averaged number of samples used, both on a log-log scale. The parameters for DR Q-learning in Figure 2a are set according to 8. On the other hand, Figure 2b plots the averaged error achieved by the variance-reduced algorithm after each epoch against the total number of samples used. In the subsequent developments, we use ml = 2l(1 γ) 2 for the variance-reduced Algorithm 2. As explained in Remark 12, this choice (up to a log factor) yields the same complexity guarantee as stated in Theorem 13.