reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Constants are Critical in Regret Bounds for Reinforcement Learning

Authors: Simone Drago, Marco Mussi, Alberto Maria Metelli

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical validation successfully demonstrates how improving the multiplicative constants has significant positive effects on the actual empirical performances of the algorithm under analysis. This raises the question of whether ignoring constants when assessing whether algorithms match is the proper approach. (...) In this section, we numerically compare the performances of UCBVI, both with the Chernoff-Hoeffding and Bernstein Freedman bonuses of (Azar et al., 2017) and with the improved Bernstein-Freedman bonus of this paper, against the MVP algorithm.
Researcher Affiliation	Academia	1Politecnico di Milano, Milan, Italy. Correspondence to: Simone Drago <EMAIL>.
Pseudocode	Yes	Algorithm 1: UCBVI.
Open Source Code	Yes	The code to reproduce the experiments can be found at: https://github.com/marcomussi/position_constants.
Open Datasets	Yes	We now consider the River Swim environment (Strehl & Littman, 2008).
Dataset Splits	No	The paper describes using randomly generated MDPs for illustrative environments and a benchmark environment (River Swim) with a specified number of runs and episodes, but does not provide specific training/test/validation dataset splits. The evaluation is done by averaging over multiple runs for these environments, rather than partitioning a static dataset.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments.
Software Dependencies	No	The paper mentions algorithms like UCBVI and MVP, and implicitly relies on programming languages for implementation (e.g., Python, given the GitHub link), but it does not specify any particular software libraries or their version numbers (e.g., PyTorch, TensorFlow, NumPy, scikit-learn) that would be needed to reproduce the experiments.
Experiment Setup	Yes	We consider an MDP with parameters S 3, A 3, H P t5, 10u, and we consider a number of episodes K P t105, 106u. (...) all the Nk,hpx, aq terms are considered as Nkpx, aq, removing the discriminant of the stage from the algorithm, and the c2 constant (which refers to the uncertainty in the estimation of the rewards) is set to 0, to remove the exploration factor needed due to the stochasticity of the reward in the original paper. The resulting exploration bound is: b MVP k,h px, aq 460 Vary ˆ Pkp \|x,aqp Vk,h 1pyqq log 1 delta maxt Npx, aq, 1u delta maxt Npx, aq, 1u. (...) We evaluate each experiment by averaging over 10 runs. In each run, the rewards and transition probabilities of the MDP are randomly generated. Then, the clairvoyant optimum is calculated for the purpose of regret computation, and the algorithms are evaluated.