Extreme Risk Mitigation in Reinforcement Learning using Extreme Value Theory
Authors: Karthik Somayaji NS, Yu Wang, Malachi Schram, Jan Drgona, Mahantesh M Halappanavar, Frank Liu, Peng Li
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations show that the proposed method outperforms other risk averse RL algorithms on a diverse range of benchmark tasks, each encompassing distinct risk scenarios. |
| Researcher Affiliation | Academia | Karthik Somayaji NS EMAIL Department of Electrical and Computer Engineering University of California Santa Barbara; Yu Wang EMAIL Department of Electrical and Computer Engineering University of California Santa Barbara; Malachi Schram EMAIL Thomas Jefferson National Accelerator Laboratory S; Jan Drgona EMAIL Pacific Northwest National Laboratory; Mahantesh Halappanavar EMAIL Pacific Northwest National Laboratory; Frank Liu EMAIL School of Data Science Old Dominion University; Peng Li EMAIL Department of Electrical and Computer Engineering University of California Santa Barbara |
| Pseudocode | Yes | 7.4 Algorithm for EVAC Algorithm 1: Extreme Valued Actor Critic: EVAC |
| Open Source Code | No | The paper uses well-known open-source environments like Mujoco and Safety-gym, and refers to them with citations. However, it does not provide any explicit statement or link for the source code of the authors' own methodology (EVAC) described in this paper. |
| Open Datasets | Yes | We use the Half-Cheetah environment (Brockman et al., 2016) for our demonstration. We experiment on two benchmark Open-AI environments Brockman et al. (2016) namely Mujoco environments and Safety-gym environments Ji et al. (2023). We employ mobile-env (Schneider et al., 2022), an open source environment that simulates the connections and Qo E between several base stations and cell phone users. |
| Dataset Splits | No | The paper describes reinforcement learning environments where data is generated through agent interaction. It specifies training duration (100,000 time steps) and evaluation procedure (inference on 5 trained agents, each completing an episode) but does not provide explicit training/test/validation dataset splits in the conventional supervised learning sense. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to conduct the experiments. |
| Software Dependencies | No | The paper mentions using Python and various environments (Mujoco, Safety-gym, mobile-env) but does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | During training and inference, the max episode length of the agent is set to 1000. During training, the agents were trained for 100,000 time steps on the whole. The batch size B = 128 and we set K, the number of samples sampled from the GPD distribution to 50. We set the learning rates for the actor and critic to 0.001 in all cases. The discount factor γ = 0.99 for all cases too. The soft update parameter τ = 0.02 for all our experiments on the Hopper and Walker2d, while τ = 0.01 for the Half Cheetah environment. Both the actor and critic have 3 layers with hidden size being 128. |