Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning
Authors: Taylor W. Killian, Sonali Parbhoo, Marzyeh Ghassemi
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the utility of Distributional Dead-end Discovery (Dist De D) in a toy domain as well as when assessing the risk of severely ill patients in the intensive care unit reaching a point where death is unavoidable. We find that Dist De D significantly improves over prior discovery approaches, providing indications of the risk 10 hours earlier on average as well as increasing detection by 20%. Finally, we provide empirical evidence that our proposed framework enables an earlier determination of high-risk areas of the state space on both a simulated environment and a real application within healthcare of treating patients with sepsis. |
| Researcher Affiliation | Academia | Taylor W. Killian EMAIL University of Toronto, Vector Institute Massachusetts Institute of Technology; Sonali Parbhoo EMAIL Imperial College London; Marzyeh Ghassemi EMAIL Massachusetts Institute of Technology CIFAR AI Chair, Vector Institute |
| Pseudocode | No | The paper describes the Dist De D framework in prose and with a diagram in Figure 2, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All code for data extraction and preprocessing as well as for defining and training Dist De D models can be found at https://github.com/MLfor Health/Dist De D. |
| Open Datasets | Yes | We use the MIMIC-IV (Medical Information Mart for Intensive Care; v2.0) database, sourced from the Beth Israel Deaconess Medical Center in Boston, Massachusetts Johnson et al. (2020). This database contains deidentified treatment records of patients admitted to critical care units (CCU, CSRU, MICU, SICU, TSICU). The citation for MIMIC-IV is: Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2020). MIMIC-IV. PhysioNet. Available online at: https://physionet.org/content/mimiciv/1.0/(accessed August 23, 2021), 2020. |
| Dataset Splits | Yes | All models are trained with 75% of the data (4,014 surviving patients, 627 patients who died),validated with 5% (268 survivors, 42 nonsurvivors), and we report all results on the remaining held out 20% (1,070 survivors, 167 nonsurvivors). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running its experiments. |
| Software Dependencies | No | The paper mentions software like Ax, BoTorch, and the Adam optimizer, but does not provide specific version numbers for these or any other key software components used in the experimental setup. |
| Experiment Setup | Yes | For De D, we model the QD and QR functions using the DDQN architecture (Hasselt et al., 2016) using two layers of 32-nodes with Re LU activations and a learning rate of 1e 3. For Dist De D we utilize IQN architectures (Dabney et al., 2018) for both ZD and ZR using two layers of 32 nodes, Re LU activations and the same learning rate of 1e 3. For each IQN model, we sample N, N = 8 particles from the local and target τ distributions while training and also weight the CQL penalty β = 0.1. When evaluating ZD and ZR, we select K = 1000 particles and set our confidence level to α = 0.1. Additionally, Appendix A.2.1 states: For the encoding neural network, we used 2 layers with 80 hidden units in each with Re LU activations. The output dimension of this encoding network was 55... For optimization, the best learning rate was 5e 4 over 30 epochs. Appendix A.2.2 further states: For the IQN, the projection neural network accepted a 55 dimensional input (from the NCDE), consisted of 2 layers with 16 hidden units in each, using Re LU activations. The number of samples K drawn each optimization step was set to 64. The target network parameters were updated after every 5 optimization steps using an exponentially-weighted moving average with parameter τ set to 0.005. By construction, the discount rate γ is set to 1. For the weighting of the CQL penalty, β = 0.035. For optimization, we used Adam (Kingma & Ba, 2014) with the best performing learning rate found to be 2e 5 over 75 epochs of training. |