Finite-Time Analysis of Temporal Difference Learning with Experience Replay

Authors: Han-Dong Lim, Donghwan Lee

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we present a simple decomposition of the Markovian noise terms and provide finite-time error bounds for tabular on-policy TD-learning with experience replay. Specifically, under the Markovian observation model, we demonstrate that for both the averaged iterate and final iterate cases, the error term induced by a constant step-size can be effectively controlled by the size of the replay buffer and the mini-batch sampled from the experience replay buffer. Our primary theoretical contribution is the derivation of the convergence rate of tabular on-policy TD-learning with experience replay memory under Markovian observation models.
Researcher Affiliation Academia Han-Dong Lim EMAIL Department of Electrical Engineering Korea Advanced Institute of Science and Technology; Donghwan Lee EMAIL Department of Electrical Engineering Korea Advanced Institute of Science and Technology
Pseudocode Yes Algorithm 1 TD-learning with replay buffer 1: Initialize V0 R|S| such that ||V0|| Rmax 1 γ 2: Collect N samples : B 1 := {O N, O N+1, . . . , O 1}. 3: for k T do 4: Select action ak π( |sk). 5: Observe sk+1 P( |sk, ak) and recieve reward rk+1 := r(sk, ak, sk+1). 6: Update replay buffer : Bπ k Bπ k 1 \ {(sk N, rk N+1, sk N+1)} {(sk, rk+1, sk+1)}. 7: Uniformly sample M π k Bπ k . 8: Update Vk+1 = Vk + αk 1 |M π k | P (s,r,s ) M π k (es)(r + γe s Vk e s Vk).
Open Source Code No The paper does not contain any explicit statement about making source code publicly available, nor does it provide a link to a code repository.
Open Datasets No The paper is a theoretical analysis of TD-learning and does not utilize any specific, named public datasets for experimental evaluation. It focuses on the theoretical properties under a Markovian observation model.
Dataset Splits No The paper conducts theoretical analysis and does not report on experiments using specific datasets. Therefore, there is no mention of dataset splits for training, validation, or testing.
Hardware Specification No This paper is a theoretical work focusing on the finite-time analysis of TD-learning. It does not describe any experimental setup or hardware used for computations.
Software Dependencies No The paper focuses on theoretical analysis and does not describe any implementation details or software dependencies with specific version numbers.
Experiment Setup No The paper is theoretical in nature, providing a finite-time analysis of TD-learning. It does not include an experimental setup section with hyperparameter values or training configurations.