A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents

Authors: Kaiwen Wang, Dawen Liang, Nathan Kallus, Wen Sun

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Simulation Experiments We describe a numerical simulation to demonstrate the importance of learning history-dependent policies for OCE RL and to empirically evaluate our algorithms. Our code can be found at https://github.com/kaiwenw/oce-rl. Setting up synthetic MDP. The proof-of-concept MDP is shown in Figure 1 and has two states. ... Experiment with tabular policies. ... Experiment with neural network policies. ... We plot the learning curves in Figure 2...
Researcher Affiliation Collaboration 1Cornell Tech 2Netflix Research. Correspondence to: Kaiwen Wang <kaiwenw.github.io>. Work done as Netflix intern.
Pseudocode Yes Algorithm 1 Meta-algorithm for optimistic oracles 1: Input: number of rounds K, optimistic oracle OPTALG satisfying Def. 3.1. 2: for round k = 1, 2, . . . , K do 3: Query OPTALG in Aug MDP for value func. b V1,k( ).
Open Source Code Yes Our code can be found at https://github.com/kaiwenw/oce-rl.
Open Datasets No Setting up synthetic MDP. The proof-of-concept MDP is shown in Figure 1 and has two states. At s1, all actions lead to a random reward r1 Ber(0.5) and transits to s2. At s2, the first action a1 gives a random reward r2 | s2, a1 1.5 Ber(0.75), while another action a2 gives a deterministic reward r2 | s2, a2 = 0.5. The trajectory ends after s2. (The paper defines the MDP for simulation but does not provide access information for a publicly available dataset.)
Dataset Splits No The paper uses a synthetic MDP for its simulation experiments, which is an environment defined for a proof-of-concept rather than a dataset with traditional train/test/validation splits. The text mentions 'We repeat runs five times' which refers to experiment repetitions, not data partitioning for model training.
Hardware Specification No The paper does not provide specific details on the hardware used to run the experiments, such as CPU or GPU models, or cloud computing specifications.
Software Dependencies No The paper mentions deep RL oracles like PPO and REINFORCE, and the Adam optimizer, but does not specify their version numbers or any other software dependencies with their exact versions.
Experiment Setup Yes C. More Details on Experimental Setup... Table 5. Hyperparameter settings used in our experiments. Component Value/Description Policy Network Softmax policy with MLP with two hidden layers of dimension 64 Value Network MLP with two hidden layers of dimension 64 Optimizer Adam with β1 = 0.9, β2 = 0.999 Batch Size 256 Learning Rate 5e-3 PPO KL weight 0.1 Regularization Weight 0.1