Model Risk-sensitive Offline Reinforcement Learning

Authors: Gwangpyo Yoo, Honguk Woo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments in finance and self-driving scenarios demonstrate that the proposed framework significantly reduces risk, by 11.2% to 18.5%, compared to the most outperforming risk-sensitive offline RL baseline, particularly in highly uncertain environments. We evaluate MR-IQN on finance and self-driving scenarios comparing with the baselines above. The label Mean is the average of the mean score over seeds and Hϕ(Zπ) is the mean negative risk over seeds. All reported scores are averaged across 5 seeds. Numerical descriptions provided without explicit criteria are the results of comparing the negative risk. We also present the D4RL results, comparing the baselines and 1R2R, in Appendix A.1. The results and analysis about CV@R(10%) risks are in Appendix A.4, and the dataset details are provided in Appendix B.
Researcher Affiliation Academia Gwangpyo Yoo, Honguk Woo Department of Computer Science and Engineering, Sungkyunkwan University EMAIL
Pseudocode Yes Algorithm 1 Critic-Ensemble Model Risk
Open Source Code No The paper references third-party tools and baseline implementations with links, such as 'Haghpanah Mohammad Amin. gym-mtsim. https://github.com/Amin HP/gym-mtsim, 2021.' and 'Marc Rigter, Bruno Lacerda, and Nick Hawes. One risk to rule them all: official code, 2023. URL https://github.com/marc-rigter/1R2R.'. However, there is no explicit statement or link provided for the open-source code of the methodology described in this paper (MR-IQN).
Open Datasets Yes The trading environment is implemented using MT-sim (Amin, 2021). The data ranges from February 3rd to December 1st, 2023. [...] The forex data is collected from Meta Trader 4 (Sajedi, 2024), and the stock data from Yahoo Finance (Perlin, 2023). [...] Since D4RL is a standard benchmark for offline RL training, we report the D4RL scores here.
Dataset Splits No The paper states: 'We evaluate 1000 episodes for each seed to calculate CV@R and Wang negative risk.' for finance scenarios and 'We evaluate performance in 100 episodes for each seed to measure negative risk.' for self-driving scenarios. While it mentions D4RL, it does not provide specific training/test/validation splits or references to standard D4RL splits for reproduction. For the custom finance and self-driving environments, evaluation is described by episodes, but no explicit training/test/validation splits are detailed.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU models, memory) used for running the experiments.
Software Dependencies Yes Marcelo Perlin. Yahoo finance, 2023. URL https://github.com/ropensci/yf R. Python package version 0.2.38. Reza Sajedi. Meta trader5, 2024. URL https://www.metatrader5.com/en/ trading-platform. Python package version 5.0.4200. Table D also lists 'Pytorch' as a hyperparameter setting, implying its use.
Experiment Setup Yes Table D: Hyperparameter settings lists specific values for learning rate, optimizer (Adam (β1 = 0.9, β2 = 0.999)), discount factor (γ), batch size, number of critics, number of quantiles, soft update ratio (τ), IQN parameters, Fourier feature parameters, TQC dropout, and policy delay. The text 'Table D lists up the detailed hyperparameters of the experiments' also supports this.