reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Finite-Time Analysis of Temporal Difference Learning with Experience Replay

Authors: Han-Dong Lim, Donghwan Lee

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we present a simple decomposition of the Markovian noise terms and provide finite-time error bounds for tabular on-policy TD-learning with experience replay. Specifically, under the Markovian observation model, we demonstrate that for both the averaged iterate and final iterate cases, the error term induced by a constant step-size can be effectively controlled by the size of the replay buffer and the mini-batch sampled from the experience replay buffer. Our primary theoretical contribution is the derivation of the convergence rate of tabular on-policy TD-learning with experience replay memory under Markovian observation models.
Researcher Affiliation	Academia	Han-Dong Lim EMAIL Department of Electrical Engineering Korea Advanced Institute of Science and Technology; Donghwan Lee EMAIL Department of Electrical Engineering Korea Advanced Institute of Science and Technology
Pseudocode	Yes	Algorithm 1 TD-learning with replay buffer 1: Initialize V0 R\|S\| such that \|\|V0\|\| Rmax 1 γ 2: Collect N samples : B 1 := {O N, O N+1, . . . , O 1}. 3: for k T do 4: Select action ak π( \|sk). 5: Observe sk+1 P( \|sk, ak) and recieve reward rk+1 := r(sk, ak, sk+1). 6: Update replay buffer : Bπ k Bπ k 1 \ {(sk N, rk N+1, sk N+1)} {(sk, rk+1, sk+1)}. 7: Uniformly sample M π k Bπ k . 8: Update Vk+1 = Vk + αk 1 \|M π k \| P (s,r,s ) M π k (es)(r + γe s Vk e s Vk).
Open Source Code	No	The paper does not contain any explicit statement about making source code publicly available, nor does it provide a link to a code repository.
Open Datasets	No	The paper is a theoretical analysis of TD-learning and does not utilize any specific, named public datasets for experimental evaluation. It focuses on the theoretical properties under a Markovian observation model.
Dataset Splits	No	The paper conducts theoretical analysis and does not report on experiments using specific datasets. Therefore, there is no mention of dataset splits for training, validation, or testing.
Hardware Specification	No	This paper is a theoretical work focusing on the finite-time analysis of TD-learning. It does not describe any experimental setup or hardware used for computations.
Software Dependencies	No	The paper focuses on theoretical analysis and does not describe any implementation details or software dependencies with specific version numbers.
Experiment Setup	No	The paper is theoretical in nature, providing a finite-time analysis of TD-learning. It does not include an experimental setup section with hyperparameter values or training configurations.