Online Episodic Convex Reinforcement Learning
Authors: Bianca Marin Moreno, Khaled Eldowa, Pierre Gaillard, Margaux Brégère, Nadia Oudjane
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Bonus O-MD-CURL on the multi-objective and constrained MDP tasks from (Geist et al., 2022), which use fixed objective functions and fixed probability kernels across time steps. ... These examples empirically demonstrate the value of the additive bonus in tasks requiring exploration. |
| Researcher Affiliation | Collaboration | 1Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. 2EDF Lab, 7 bd Gaspard Monge, 91120 Palaiseau, France 3Fi ME (Laboratoire de Finance des March es de l Energie Dauphine, CREST, EDF R&D) 4Universit a degli Studi di Milano, Milan, Italy 5Politecnico di Milano, Milan, Italy 6Sorbonne Universit e LPSM, Paris, France. |
| Pseudocode | Yes | Algorithm 1 Bonus O-MD-CURL (Full-information) |
| Open Source Code | Yes | 1The code to reproduce the empirical results are available at: https://github.com/biancammoreno/Convex_RL |
| Open Datasets | Yes | We evaluate Bonus O-MD-CURL on the multi-objective and constrained MDP tasks from (Geist et al., 2022), which use fixed objective functions and fixed probability kernels across time steps. |
| Dataset Splits | No | The paper describes the environment setup (e.g., "11 x 11 four-room grid world") and task parameters ("N 40, τ 0.01, and 5 repetitions per experiment") but does not provide specific training/test/validation dataset splits typically found in supervised learning. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., GPU/CPU models) used for running the experiments. It only describes the simulation environment and experimental parameters. |
| Software Dependencies | No | The paper mentions that code is available on GitHub but does not specify any software libraries, frameworks, or their version numbers used in the implementation. |
| Experiment Setup | Yes | The state space is an 11 × 11 four-room grid world, with a single door connecting adjacent rooms. The agent can choose to stay still or move right, left, up, or down... The initial distribution is a Dirac delta at the upper left corner of the grid, as in Fig. 1 [left]. We take N 40, τ 0.01, and 5 repetitions per experiment. Multi-objectives: The goal is to concentrate the distribution on three targets by the final step N, as in Fig. 1 [middle]. The objective function is defined as fnpµπ,p n q : ř3 k 1p1 xµπ,p n , ekyq2... Constrained MDPs: The goal is to concentrate the state distribution on the yellow target in Fig. 1 [right] while avoiding the constraint states in blue. The objective function is defined as fnpµπ,p n q : xr, µπ,p n y pxµπ,p n , cyq2... |