Online Learning in Risk Sensitive constrained MDP
Authors: Arnob Ghosh, Mehrdad Moharrami
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We simulate our proposed approach on a 5 × 5 Grid-World with two actions: ⇐ and ⇒. In all simulations, we use K = 15, 000, ξ = K 1/4, and η = c K 1/4/Vg,max, where the coefficient c is linearly scaled down from 100 to 1 across episodes to improve the convergence rate (i.e., a larger η in earlier episodes). Further details of the simulation setup, including the reward, utility, and transition probabilities, are provided in Appendix I. Figure 1 illustrates the behavior of the algorithm for α = 0.0001 as a function of k. Table 1 presents the empirical reward values V emp r,1 (s1) and empirical risk values V emp g,1 (s1) for the average policy induced by the final 20 episodes after K = 15,000 iterations, across various choices of α and B. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, USA 2Department of Computer Science, University of Iowa, Iowa City, USA. Correspondence to: Arnob Ghosh <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Constraint Risk Sensitive Value Iteration Algorithm |
| Open Source Code | Yes | The code is available at: https://github.com/mmoharami/Risk-Sensitive-CMDP. |
| Open Datasets | No | The information is insufficient. The paper describes a custom simulation environment (a 5 × 5 Grid-World) within its text and tables (Tables 2, 3, 4), but it does not use or provide access to an external publicly available dataset according to the prompt's criteria (no link, DOI, specific repository, or citation to an established benchmark dataset). |
| Dataset Splits | No | The information is insufficient. The paper describes an episodic reinforcement learning setting with a total number of episodes (K=15,000) for simulations. It does not mention traditional training/test/validation dataset splits, which are typically used in supervised learning or offline evaluation scenarios. |
| Hardware Specification | No | The information is insufficient. No specific hardware details (such as GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running the experiments are mentioned in the paper. |
| Software Dependencies | No | The information is insufficient. The paper provides a link to its code repository, but it does not explicitly list specific ancillary software dependencies with version numbers (e.g., programming languages, libraries, or frameworks) needed to replicate the experiment. |
| Experiment Setup | Yes | In all simulations, we use K = 15, 000, ξ = K 1/4, and η = c K 1/4/Vg,max, where the coefficient c is linearly scaled down from 100 to 1 across episodes to improve the convergence rate (i.e., a larger η in earlier episodes). Further details of the simulation setup, including the reward, utility, and transition probabilities, are provided in Appendix I. For a faster convergence, we use the bonus terms Bonk r,h(s, a) = 0.5H log(K)/N k h(s, a), and Bonk g,h(s, a) = 0.005Vg,max log(K)/N k h(s, a) instead of the values for which we obtain the regret and the violation bound across all the values of α. We use the discretized budget space with precision K 1/2, and δ = 0.05. The initial policy π0 is uniform across the two actions for every augmented state. The total horizon is H = 9. |