Constrained episodic reinforcement learning in concave-convex and knapsack settings
Authors: Kianté Brantley, Miro Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, Wen Sun
NeurIPS 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that the proposed algorithm significantly outperforms these approaches in constrained episodic benchmarks. |
| Researcher Affiliation | Collaboration | Kianté Brantley University of Maryland EMAIL; Miroslav Dudík Microsoft Research EMAIL; Thodoris Lykouris Microsoft Research EMAIL; Sobhan Miryoosefi Princeton University EMAIL; Max Simchowitz UC Berkeley EMAIL; Aleksandrs Slivkins Microsoft Research EMAIL; Wen Sun Cornell University EMAIL |
| Pseudocode | No | The paper describes algorithms and their components (e.g., CONRL, CONPLANNER) and how to solve optimization problems as linear programs, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/miryoosefi/Con RL |
| Open Datasets | Yes | We run our experiments on two grid-world environments Mars rover (Tessler et al., 2019) and Box (Leike et al., 2017). |
| Dataset Splits | No | The paper describes running experiments on grid-world environments and training over a number of trajectories, but it does not specify traditional dataset splits (e.g., training, validation, test percentages or counts) as commonly seen in supervised learning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | The episode horizon H is 30 and the agent s action is perturbed with probability 0.1 to a random action. APPROPO focuses on the feasibility problem, so it requires to specify a lower bound on the reward, which we set to 0.3 for Mars rover and 0.1 for Box. |