Position: Constants are Critical in Regret Bounds for Reinforcement Learning

Authors: Simone Drago, Marco Mussi, Alberto Maria Metelli

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical validation successfully demonstrates how improving the multiplicative constants has significant positive effects on the actual empirical performances of the algorithm under analysis. This raises the question of whether ignoring constants when assessing whether algorithms match is the proper approach. (...) In this section, we numerically compare the performances of UCBVI, both with the Chernoff-Hoeffding and Bernstein Freedman bonuses of (Azar et al., 2017) and with the improved Bernstein-Freedman bonus of this paper, against the MVP algorithm.
Researcher Affiliation Academia 1Politecnico di Milano, Milan, Italy. Correspondence to: Simone Drago <EMAIL>.
Pseudocode Yes Algorithm 1: UCBVI.
Open Source Code Yes The code to reproduce the experiments can be found at: https://github.com/marcomussi/position_constants.
Open Datasets Yes We now consider the River Swim environment (Strehl & Littman, 2008).
Dataset Splits No The paper describes using randomly generated MDPs for illustrative environments and a benchmark environment (River Swim) with a specified number of runs and episodes, but does not provide specific training/test/validation dataset splits. The evaluation is done by averaging over multiple runs for these environments, rather than partitioning a static dataset.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments.
Software Dependencies No The paper mentions algorithms like UCBVI and MVP, and implicitly relies on programming languages for implementation (e.g., Python, given the GitHub link), but it does not specify any particular software libraries or their version numbers (e.g., PyTorch, TensorFlow, NumPy, scikit-learn) that would be needed to reproduce the experiments.
Experiment Setup Yes We consider an MDP with parameters S 3, A 3, H P t5, 10u, and we consider a number of episodes K P t105, 106u. (...) all the Nk,hpx, aq terms are considered as Nkpx, aq, removing the discriminant of the stage from the algorithm, and the c2 constant (which refers to the uncertainty in the estimation of the rewards) is set to 0, to remove the exploration factor needed due to the stochasticity of the reward in the original paper. The resulting exploration bound is: b MVP k,h px, aq 460 Vary ˆ Pkp |x,aqp Vk,h 1pyqq log 1 delta maxt Npx, aq, 1u delta maxt Npx, aq, 1u. (...) We evaluate each experiment by averaging over 10 runs. In each run, the rewards and transition probabilities of the MDP are randomly generated. Then, the clairvoyant optimum is calculated for the purpose of regret computation, and the algorithms are evaluated.