An Optimistic Algorithm for online CMDPS with Anytime Adversarial Constraints
Authors: Jiahui Zhu, Kihyun Yu, Dabeen Lee, Xin Liu, Honghao Wei
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our algorithm in a synthetic and finite-horizon CMDP environment constructed to assess performance under both stochastic and adversarial cost settings. We plot the cumulative constraint violation across learning episodes(K 3000), where both the stochastic and adversarial curves clearly demonstrate the algorithm s ability to ensure sublinear violation growth. |
| Researcher Affiliation | Academia | 1School of Electrical Engineering & Computer Science, Pullman, USA 2Department of Industrial & Systems Engineering, KAIST, Daejeon, South Korea 3School of Information Science & Technology, Shanghai Tech University, Shanghai, China. |
| Pseudocode | Yes | Algorithm 1 OMDPD |
| Open Source Code | No | The paper does not provide any explicit statement or link regarding the availability of source code for the described methodology. |
| Open Datasets | No | The paper uses a "synthetic and finite-horizon CMDP environment" for simulation. It does not mention using or providing access to any publicly available dataset. |
| Dataset Splits | No | The paper describes the parameters of a synthetic CMDP environment used for simulation (e.g., state space, action space, horizon, reward/cost sampling), but it does not specify any training/test/validation dataset splits, as it's a simulated environment rather than a fixed dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory specifications) used for running the simulations or experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies, libraries, or solvers with version numbers. |
| Experiment Setup | Yes | We evaluate our algorithm in a synthetic and finite-horizon CMDP environment constructed to assess performance under both stochastic and adversarial cost settings. The CMDP consists of a state space S t0, 1, 2, 3, 4u with five discrete states and an action space A t0, 1, 2u with three available actions. The decision process unfolds over a fixed horizon of H 5 steps. At each time step, the agent receives a reward r P r0, 1s HˆSˆA sampled uniformly from the unit interval. In the stochastic setting, the cost c P r 1, 1s HˆSˆA is also drawn uniformly and held fixed across episodes. In contrast, the adversarial setting introduces a discrete cost perturbation mechanism: in each episode, the cost is independently sampled from a finite set t 1.0, 0.6, 0.2, 0.0, 0.2, 0.6, 1.0u, simulating abrupt shifts in constraint feedback. The transition dynamics are time-dependent, where each transition distribution Ph,s,a is independently sampled from a Dirichlet distribution with a concentration parameter α 0.5. The initial state is sampled uniformly, ensuring that each trajectory starts from a randomly selected state. Throughout all experiments, the cumulative cost constraint threshold is 0. We plot the cumulative constraint violation across learning episodes(K 3000). |