Finite-Time Analysis of Temporal Difference Learning with Experience Replay
Authors: Han-Dong Lim, Donghwan Lee
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we present a simple decomposition of the Markovian noise terms and provide finite-time error bounds for tabular on-policy TD-learning with experience replay. Specifically, under the Markovian observation model, we demonstrate that for both the averaged iterate and final iterate cases, the error term induced by a constant step-size can be effectively controlled by the size of the replay buffer and the mini-batch sampled from the experience replay buffer. Our primary theoretical contribution is the derivation of the convergence rate of tabular on-policy TD-learning with experience replay memory under Markovian observation models. |
| Researcher Affiliation | Academia | Han-Dong Lim EMAIL Department of Electrical Engineering Korea Advanced Institute of Science and Technology; Donghwan Lee EMAIL Department of Electrical Engineering Korea Advanced Institute of Science and Technology |
| Pseudocode | Yes | Algorithm 1 TD-learning with replay buffer 1: Initialize V0 R|S| such that ||V0|| Rmax 1 γ 2: Collect N samples : B 1 := {O N, O N+1, . . . , O 1}. 3: for k T do 4: Select action ak π( |sk). 5: Observe sk+1 P( |sk, ak) and recieve reward rk+1 := r(sk, ak, sk+1). 6: Update replay buffer : Bπ k Bπ k 1 \ {(sk N, rk N+1, sk N+1)} {(sk, rk+1, sk+1)}. 7: Uniformly sample M π k Bπ k . 8: Update Vk+1 = Vk + αk 1 |M π k | P (s,r,s ) M π k (es)(r + γe s Vk e s Vk). |
| Open Source Code | No | The paper does not contain any explicit statement about making source code publicly available, nor does it provide a link to a code repository. |
| Open Datasets | No | The paper is a theoretical analysis of TD-learning and does not utilize any specific, named public datasets for experimental evaluation. It focuses on the theoretical properties under a Markovian observation model. |
| Dataset Splits | No | The paper conducts theoretical analysis and does not report on experiments using specific datasets. Therefore, there is no mention of dataset splits for training, validation, or testing. |
| Hardware Specification | No | This paper is a theoretical work focusing on the finite-time analysis of TD-learning. It does not describe any experimental setup or hardware used for computations. |
| Software Dependencies | No | The paper focuses on theoretical analysis and does not describe any implementation details or software dependencies with specific version numbers. |
| Experiment Setup | No | The paper is theoretical in nature, providing a finite-time analysis of TD-learning. It does not include an experimental setup section with hyperparameter values or training configurations. |