Online Learning in Risk Sensitive constrained MDP

Authors: Arnob Ghosh, Mehrdad Moharrami

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We simulate our proposed approach on a 5 × 5 Grid-World with two actions: ⇐ and ⇒. In all simulations, we use K = 15, 000, ξ = K 1/4, and η = c K 1/4/Vg,max, where the coefficient c is linearly scaled down from 100 to 1 across episodes to improve the convergence rate (i.e., a larger η in earlier episodes). Further details of the simulation setup, including the reward, utility, and transition probabilities, are provided in Appendix I. Figure 1 illustrates the behavior of the algorithm for α = 0.0001 as a function of k. Table 1 presents the empirical reward values V emp r,1 (s1) and empirical risk values V emp g,1 (s1) for the average policy induced by the final 20 episodes after K = 15,000 iterations, across various choices of α and B.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, USA 2Department of Computer Science, University of Iowa, Iowa City, USA. Correspondence to: Arnob Ghosh <EMAIL>.
Pseudocode Yes Algorithm 1 Constraint Risk Sensitive Value Iteration Algorithm
Open Source Code Yes The code is available at: https://github.com/mmoharami/Risk-Sensitive-CMDP.
Open Datasets No The information is insufficient. The paper describes a custom simulation environment (a 5 × 5 Grid-World) within its text and tables (Tables 2, 3, 4), but it does not use or provide access to an external publicly available dataset according to the prompt's criteria (no link, DOI, specific repository, or citation to an established benchmark dataset).
Dataset Splits No The information is insufficient. The paper describes an episodic reinforcement learning setting with a total number of episodes (K=15,000) for simulations. It does not mention traditional training/test/validation dataset splits, which are typically used in supervised learning or offline evaluation scenarios.
Hardware Specification No The information is insufficient. No specific hardware details (such as GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running the experiments are mentioned in the paper.
Software Dependencies No The information is insufficient. The paper provides a link to its code repository, but it does not explicitly list specific ancillary software dependencies with version numbers (e.g., programming languages, libraries, or frameworks) needed to replicate the experiment.
Experiment Setup Yes In all simulations, we use K = 15, 000, ξ = K 1/4, and η = c K 1/4/Vg,max, where the coefficient c is linearly scaled down from 100 to 1 across episodes to improve the convergence rate (i.e., a larger η in earlier episodes). Further details of the simulation setup, including the reward, utility, and transition probabilities, are provided in Appendix I. For a faster convergence, we use the bonus terms Bonk r,h(s, a) = 0.5H log(K)/N k h(s, a), and Bonk g,h(s, a) = 0.005Vg,max log(K)/N k h(s, a) instead of the values for which we obtain the regret and the violation bound across all the values of α. We use the discretized budget space with precision K 1/2, and δ = 0.05. The initial policy π0 is uniform across the two actions for every augmented state. The total horizon is H = 9.