reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Logarithmic Regret for Online KL-Regularized Reinforcement Learning

Authors: Heyang Zhao, Chenlu Ye, Wei Xiong, Quanquan Gu, Tong Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we propose an optimismbased KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret... We establish the theoretical guarantees for the algorithms and demonstrate their statistical advantages over standard RL scenarios. We summarize our contributions as follows. For KL-regularized contextual bandits, we establish the first O η log(NRT) d R regret bound that scales logarithmically with time steps T in the standard online RL setting... We propose two provably efficient algorithms: KL-regularized UCB and KL-regularized LSVI with UCB, based on the standard optimism principle and show that they achieve regret bounds that scale logarithmically with T...
Researcher Affiliation	Academia	1University of California, Los Angeles, CA 90095, USA 2University of Illinois Urbana-Champaign, IL 61801, USA.
Pseudocode	Yes	Algorithm 1 KL-Regularized UCB Algorithm 2 KL-regularized LSVI with UCB
Open Source Code	No	The paper does not provide any statements about releasing source code or links to a code repository.
Open Datasets	Yes	For instance, contrastive learning without regularization, as observed in Meng et al. (2024), can degrade performance on standard reasoning benchmarks like MATH (Hendrycks et al., 2021) and GSM8K (Cobbe et al., 2021).
Dataset Splits	No	The paper is theoretical and does not describe any experimental setup or dataset splits.
Hardware Specification	No	The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies	No	The paper is theoretical and does not list any specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and does not describe any experimental setup or specific hyperparameters.