Risk‑Seeking Reinforcement Learning via Multi‑Timescale EVaR Optimization
Authors: Deep Kumar Ganguly, Ajin George Joseph, Sarthak Girotra, Sirish Sekhar
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analyze the asymptotic behavior of our proposed algorithm and rigorously evaluate it across various discrete and continuous benchmark environments. The results highlight that the EVa R policy achieves higher cumulative returns and corroborate that EVa R is indeed a competitive risk-seeking objective for RL. We evaluate our method on both discrete and continuous-control benchmarks. For each environment, we report environment-specific indicators including mean return, tail-risk metrics, dispersion across random seeds, and learning-curve behaviour. We also conduct selective ablation studies on stepsize and perturbation schedules to isolate their effects. |
| Researcher Affiliation | Academia | Deep Ganguly EMAIL Department of Computer Science and Engineering Indian Institute of Technology Tirupati Sarthak Girotra EMAIL Department of Computer Science and Engineering Indian Institute of Technology Tirupati Sirish Sekhar EMAIL Department of Computer Science and Engineering Indian Institute of Technology Tirupati Ajin George Joseph EMAIL Department of Computer Science and Engineering Indian Institute of Technology Tirupati |
| Pseudocode | Yes | Algorithm 1 Multi-timescale EVa R optimization Algorithm 2 EVa R Optimization using Disciplied Convex Cone |
| Open Source Code | No | The paper mentions third-party open-source libraries like Simglucose, Open AI Gym, Riskfolio-lib, and Stable Baselines 3 that were used. It also states: "Complete implementation details, hyperparameters, and reproducibility artefacts are provided in D and C." and Appendix D states: "The supplementary material provided includes all the experiments with their obtained values, which are reported here in a visual format." However, there is no explicit statement of the authors releasing the source code for their own methodology described in this paper, nor is a direct link to their code repository provided. |
| Open Datasets | Yes | We consider the Open AI Gym environments Inverted-Double-Pendulum/v4 and Swimmer/v4 from the Mu Jo Co framework (Tassa et al., 2018) and Mountain-Car-Continuous/v0 from the Box2D Gym framework (Towers et al., 2023). We demonstrate our algorithm s ability to manage high-risk insulin administration for Type-1 Diabetes Mellitus (T1DM) using the Simglucose simulator (Xie, 2018). The portfolio optimization problem seeks an optimal portfolio allocation among N assets by maximizing the EVa R of the portfolio returns R, which captures the upside tail of the return distribution. Here, policy represents the action chosen, which includes sell, buy, or hold. Constraints are kept to ensure that the portfolio weights wi are nonnegative and sum to one, representing a fully invested portfolio. For our portfolio (top 10 stocks of DJIA). |
| Dataset Splits | No | Across 5,000 evaluation episodes and 200 distinct obstacle layouts we observe a markedly heavy tailed distribution. All methods operate in a fully controlled tabular setting with identical finite-horizon MDPs, tabular state action value tables initialized to zero, ϵ-greedy exploration (ϵ 0.1), discount factor γ 0.99, and fixed learning rate of 0.1. By limiting all algorithms to 500 episodes per seed (truncated at 200 steps) and averaging over eight independent random seeds, we ensure that any performance differential arises exclusively from the choice of risk criterion and its estimator, rather than from architectural capacity or extensive hyperparameter tuning. |
| Hardware Specification | Yes | The experiments were conducted on NVIDIA DGX A100, having an AMD EPYC 7742 64-core processor operating at 1.5 GHz 3.39 GHz, with GDDR5 32 GB RAM, NVIDIA A100-SXM4-40 GB GPU at 1.41 GHz, and memory clocked at 1.21 GHz. |
| Software Dependencies | Yes | The operating CUDA version for Py Torch 1.13.1 is 11.6 for Python version 3.10.13. |
| Experiment Setup | Yes | Complete implementation details, hyperparameters, and reproducibility artefacts are provided in D and C. Table 5: Hyperparameters for Finite Difference Gradient Estimation (lists Timeout, Iterations, Learning Rate Decay, Learning Rate Power, Perturbation Size, Perturbation Decay, Perturbation Power, Momentum, Beta (Adam Parameter), Epsilon (Adam Parameter)). Table 6: Hyperparameters used in experiments (lists Learning rate, Constant A, Constant c, Random noise parameter δ, Action sampling, Step size δ, Step size ξ). Table 9: Hyperparameters used for training (lists Learning rate, Buffer size, Batch size, Gamma, Train frequency, Gradient steps, Entropy coefficient (ent coef), Target entropy, Tau, Policy kwargs). |