Position: Constants are Critical in Regret Bounds for Reinforcement Learning
Authors: Simone Drago, Marco Mussi, Alberto Maria Metelli
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical validation successfully demonstrates how improving the multiplicative constants has significant positive effects on the actual empirical performances of the algorithm under analysis. This raises the question of whether ignoring constants when assessing whether algorithms match is the proper approach. (...) In this section, we numerically compare the performances of UCBVI, both with the Chernoff-Hoeffding and Bernstein Freedman bonuses of (Azar et al., 2017) and with the improved Bernstein-Freedman bonus of this paper, against the MVP algorithm. |
| Researcher Affiliation | Academia | 1Politecnico di Milano, Milan, Italy. Correspondence to: Simone Drago <EMAIL>. |
| Pseudocode | Yes | Algorithm 1: UCBVI. |
| Open Source Code | Yes | The code to reproduce the experiments can be found at: https://github.com/marcomussi/position_constants. |
| Open Datasets | Yes | We now consider the River Swim environment (Strehl & Littman, 2008). |
| Dataset Splits | No | The paper describes using randomly generated MDPs for illustrative environments and a benchmark environment (River Swim) with a specified number of runs and episodes, but does not provide specific training/test/validation dataset splits. The evaluation is done by averaging over multiple runs for these environments, rather than partitioning a static dataset. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments. |
| Software Dependencies | No | The paper mentions algorithms like UCBVI and MVP, and implicitly relies on programming languages for implementation (e.g., Python, given the GitHub link), but it does not specify any particular software libraries or their version numbers (e.g., PyTorch, TensorFlow, NumPy, scikit-learn) that would be needed to reproduce the experiments. |
| Experiment Setup | Yes | We consider an MDP with parameters S 3, A 3, H P t5, 10u, and we consider a number of episodes K P t105, 106u. (...) all the Nk,hpx, aq terms are considered as Nkpx, aq, removing the discriminant of the stage from the algorithm, and the c2 constant (which refers to the uncertainty in the estimation of the rewards) is set to 0, to remove the exploration factor needed due to the stochasticity of the reward in the original paper. The resulting exploration bound is: b MVP k,h px, aq 460 Vary ˆ Pkp |x,aqp Vk,h 1pyqq log 1 delta maxt Npx, aq, 1u delta maxt Npx, aq, 1u. (...) We evaluate each experiment by averaging over 10 runs. In each run, the rewards and transition probabilities of the MDP are randomly generated. Then, the clairvoyant optimum is calculated for the purpose of regret computation, and the algorithms are evaluated. |