Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

Authors: Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study online learning in constrained MDPs (CMDPs), focusing on the goal of attaining sublinear strong regret and strong cumulative constraint violation. ... We answer such a question affirmatively, by providing an efficient policy optimization algorithm with e O(T) strong regret/violation. Our algorithm implements a primal-dual scheme that employs a state-of-the-art policy optimization approach for adversarial (unconstrained) MDPs as primal algorithm, and a UCB-like update for dual variables. ... The paper primarily focuses on presenting a novel algorithm, its theoretical properties, and bounds (e.g., e O(T) strong regret/violation), and includes proofs (Lemma 3, 4, 5, Theorem 1, 2). There are no sections describing experimental results, data analysis, or comparisons against baselines on datasets.
Researcher Affiliation Academia Francesco Emanuele Stradi Politecnico di Milano EMAIL, Matteo Castiglioni Politecnico di Milano EMAIL, Alberto Marchesi Politecnico di Milano EMAIL, Nicola Gatti Politecnico di Milano EMAIL
Pseudocode Yes Algorithm 1 Learner-Environment Interaction... Algorithm 2 Constrained Primal-Dual Policy Optimization (CPD-PO)
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository or mention code in supplementary materials.
Open Datasets No The paper is theoretical in nature, focusing on algorithm design and proving theoretical bounds in Constrained Markov Decision Processes (CMDPs). It does not describe any empirical evaluations that would involve the use of specific datasets.
Dataset Splits No The paper is theoretical and does not describe experiments using specific datasets. Therefore, there is no mention of training, validation, or test dataset splits.
Hardware Specification No The paper is theoretical and focuses on algorithm design and theoretical analysis. It does not describe any experimental setup or mention specific hardware (like GPUs, CPUs, or TPUs) used for running experiments.
Software Dependencies No The paper is theoretical and focuses on algorithm design and analysis rather than implementation and experimentation. It does not mention any specific software libraries, frameworks, or their versions that would be needed to replicate experimental results.
Experiment Setup No The paper is theoretical and focuses on the design and analysis of an algorithm for online learning in CMDPs. It does not describe any empirical experiments, and therefore, does not include details such as hyperparameter values, training configurations, or system-level settings for an experimental setup.