reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sleeping Reinforcement Learning

Authors: Simone Drago, Marco Mussi, Alberto Maria Metelli

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	E. Numerical Validation In this appendix, we propose the Stochastic Frozen Lake setting and numerically validate our S-UCBVI against UCBVI, showing the efficacy of exploiting the knowledge of action availability. The code to reproduce the experiments is available at https://github.com/marcomussi/Sleeping RL. Setting. The Stochastic Frozen Lake environment is a modification of the well-known Frozen Lake to allow holes in the lake to open and close stochastically, effectively limiting the action availability of the agent stochastically during the episode. The probability of a cell of the grid being a hole at any given stage is denoted via parameter p, except for the goal cell and the cell in which the agent is located at the beginning of the stage, which cannot be holes. We vary the probability of holes in the lake as p P t0, 0.5, 0.75u and the grid size of the lake as G P t2, 3, 4u. We consider a horizon H 10 to ensure that the agent can reach the goal. We consider K 2 105 episodes, and we compare S-UCBVI and UCBVI in terms of instantaneous reward averaged over 5 runs, with a 95% confidence interval. We also report the optimum computed apriori for reference. Results. The results of the experiment are reported in Figure 7. We observe that, when p 0, i.e., there are no holes in the lake, both S-UCBVI and UCBVI manage to achieve the optimum instantaneous reward. As p and G increase, we observe that S-UCBVI manages to achieve the optimum, whereas UCBVI settles to a suboptimal value, with the gap between the two algorithms increasing in a directly proportional manner w.r.t. the two parameters.
Researcher Affiliation	Academia	1Politecnico di Milano, Milan, Italy.
Pseudocode	Yes	Algorithm 1: Interaction Protocol Per-episode. ... Algorithm 7: Sleeping UCBVI (S-UCBVI).
Open Source Code	Yes	The code to reproduce the experiments is available at https://github.com/marcomussi/Sleeping RL.
Open Datasets	No	The paper describes using a "Stochastic Frozen Lake environment" which is a modification of a well-known environment. However, no specific access information (link, DOI, or citation) is provided for this modified environment, nor is it presented as a traditional publicly available dataset.
Dataset Splits	No	The paper states, "We consider K 2 105 episodes," but does not specify any training, testing, or validation splits for a dataset. In this Reinforcement Learning context, data is generated through interaction, not pre-split.
Hardware Specification	No	The paper does not contain specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running its experiments.
Software Dependencies	No	The paper mentions that the code is available on GitHub, implying the use of a programming language like Python, but it does not specify any particular software libraries or their version numbers that would be necessary to replicate the experiments.
Experiment Setup	Yes	We consider a horizon H 10 to ensure that the agent can reach the goal. We consider K 2 105 episodes, and we compare S-UCBVI and UCBVI in terms of instantaneous reward averaged over 5 runs, with a 95% confidence interval. We also report the optimum computed apriori for reference. ... We vary the probability of holes in the lake as p P t0, 0.5, 0.75u and the grid size of the lake as G P t2, 3, 4u.