reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Online Learning in Risk Sensitive constrained MDP

Authors: Arnob Ghosh, Mehrdad Moharrami

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We simulate our proposed approach on a 5 × 5 Grid-World with two actions: ⇐ and ⇒. In all simulations, we use K = 15, 000, ξ = K 1/4, and η = c K 1/4/Vg,max, where the coefficient c is linearly scaled down from 100 to 1 across episodes to improve the convergence rate (i.e., a larger η in earlier episodes). Further details of the simulation setup, including the reward, utility, and transition probabilities, are provided in Appendix I. Figure 1 illustrates the behavior of the algorithm for α = 0.0001 as a function of k. Table 1 presents the empirical reward values V emp r,1 (s1) and empirical risk values V emp g,1 (s1) for the average policy induced by the final 20 episodes after K = 15,000 iterations, across various choices of α and B.
Researcher Affiliation	Academia	1Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, USA 2Department of Computer Science, University of Iowa, Iowa City, USA. Correspondence to: Arnob Ghosh <EMAIL>.
Pseudocode	Yes	Algorithm 1 Constraint Risk Sensitive Value Iteration Algorithm
Open Source Code	Yes	The code is available at: https://github.com/mmoharami/Risk-Sensitive-CMDP.
Open Datasets	No	The information is insufficient. The paper describes a custom simulation environment (a 5 × 5 Grid-World) within its text and tables (Tables 2, 3, 4), but it does not use or provide access to an external publicly available dataset according to the prompt's criteria (no link, DOI, specific repository, or citation to an established benchmark dataset).
Dataset Splits	No	The information is insufficient. The paper describes an episodic reinforcement learning setting with a total number of episodes (K=15,000) for simulations. It does not mention traditional training/test/validation dataset splits, which are typically used in supervised learning or offline evaluation scenarios.
Hardware Specification	No	The information is insufficient. No specific hardware details (such as GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running the experiments are mentioned in the paper.
Software Dependencies	No	The information is insufficient. The paper provides a link to its code repository, but it does not explicitly list specific ancillary software dependencies with version numbers (e.g., programming languages, libraries, or frameworks) needed to replicate the experiment.
Experiment Setup	Yes	In all simulations, we use K = 15, 000, ξ = K 1/4, and η = c K 1/4/Vg,max, where the coefficient c is linearly scaled down from 100 to 1 across episodes to improve the convergence rate (i.e., a larger η in earlier episodes). Further details of the simulation setup, including the reward, utility, and transition probabilities, are provided in Appendix I. For a faster convergence, we use the bonus terms Bonk r,h(s, a) = 0.5H log(K)/N k h(s, a), and Bonk g,h(s, a) = 0.005Vg,max log(K)/N k h(s, a) instead of the values for which we obtain the regret and the violation bound across all the values of α. We use the discretized budget space with precision K 1/2, and δ = 0.05. The initial policy π0 is uniform across the two actions for every augmented state. The total horizon is H = 9.