A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents
Authors: Kaiwen Wang, Dawen Liang, Nathan Kallus, Wen Sun
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Simulation Experiments We describe a numerical simulation to demonstrate the importance of learning history-dependent policies for OCE RL and to empirically evaluate our algorithms. Our code can be found at https://github.com/kaiwenw/oce-rl. Setting up synthetic MDP. The proof-of-concept MDP is shown in Figure 1 and has two states. ... Experiment with tabular policies. ... Experiment with neural network policies. ... We plot the learning curves in Figure 2... |
| Researcher Affiliation | Collaboration | 1Cornell Tech 2Netflix Research. Correspondence to: Kaiwen Wang <kaiwenw.github.io>. Work done as Netflix intern. |
| Pseudocode | Yes | Algorithm 1 Meta-algorithm for optimistic oracles 1: Input: number of rounds K, optimistic oracle OPTALG satisfying Def. 3.1. 2: for round k = 1, 2, . . . , K do 3: Query OPTALG in Aug MDP for value func. b V1,k( ). |
| Open Source Code | Yes | Our code can be found at https://github.com/kaiwenw/oce-rl. |
| Open Datasets | No | Setting up synthetic MDP. The proof-of-concept MDP is shown in Figure 1 and has two states. At s1, all actions lead to a random reward r1 Ber(0.5) and transits to s2. At s2, the first action a1 gives a random reward r2 | s2, a1 1.5 Ber(0.75), while another action a2 gives a deterministic reward r2 | s2, a2 = 0.5. The trajectory ends after s2. (The paper defines the MDP for simulation but does not provide access information for a publicly available dataset.) |
| Dataset Splits | No | The paper uses a synthetic MDP for its simulation experiments, which is an environment defined for a proof-of-concept rather than a dataset with traditional train/test/validation splits. The text mentions 'We repeat runs five times' which refers to experiment repetitions, not data partitioning for model training. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used to run the experiments, such as CPU or GPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions deep RL oracles like PPO and REINFORCE, and the Adam optimizer, but does not specify their version numbers or any other software dependencies with their exact versions. |
| Experiment Setup | Yes | C. More Details on Experimental Setup... Table 5. Hyperparameter settings used in our experiments. Component Value/Description Policy Network Softmax policy with MLP with two hidden layers of dimension 64 Value Network MLP with two hidden layers of dimension 64 Optimizer Adam with β1 = 0.9, β2 = 0.999 Batch Size 256 Learning Rate 5e-3 PPO KL weight 0.1 Regularization Weight 0.1 |