Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning
Authors: Xiaoteng Ma, Shuai Ma, Li Xia, Qianchuan Zhao
JAIR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct diverse experiments from simple bandit problems to continuous control tasks in Mu Jo Co, which demonstrate the effectiveness of our proposed methods. |
| Researcher Affiliation | Academia | Xiaoteng Ma EMAIL Department of Automation, Tsinghua University, Beijing, 100086, P. R. China Shuai Ma EMAIL School of Business, Sun Yat-sen University, Guangzhou, 510275, P. R. China Li Xia EMAIL (Corresponding author) School of Business, Sun Yat-sen University, Guangzhou, 510275, P. R. China Qianchuan Zhao EMAIL Department of Automation, Tsinghua University, Beijing, 100086, P. R. China |
| Pseudocode | Yes | Algorithm 1 The framework of MSV optimization Algorithm 2 MSVAC Algorithm 3 MSVPO |
| Open Source Code | No | The paper does not contain any explicit statement about the release of source code or a link to a code repository. |
| Open Datasets | Yes | Finally, we conduct diverse experiments from simple bandit problems to continuous control tasks in Mu Jo Co, which demonstrate the effectiveness of our proposed methods. |
| Dataset Splits | No | The paper describes environments used for experiments (e.g., MuJoCo's Walker2d) but does not provide explicit training/test/validation dataset splits. Instead, it describes an experimental protocol for these environments. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., CPU, GPU models) used for running the experiments. |
| Software Dependencies | No | The paper mentions Mu Jo Co and Open AI gym, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Table 3: Hyper-parameters of MSVPO Network learning rate β: 3e-4 Network hidden sizes: [64, 64] Activation function: Tanh Optimizer: Adam Batch size: 256 Gradient Clipping: 10 Clipping parameter ε: 0.2 Optimization Epochs M: 10 GAE parameter λ: 0.95 Average Value Constraint Coefficient in APO (Ma et al., 2021) ν: 0.3 |