reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SAFER: A Calibrated Risk-Aware Multimodal Recommendation Model for Dynamic Treatment Regimes

Authors: Yishan Shen, Yuyang Ye, Hui Xiong, Yong Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on two publicly available sepsis datasets demonstrate that SAFER outperforms state-of-the-art baselines across multiple recommendation metrics and counterfactual mortality rate, while offering robust formal assurances.
Researcher Affiliation	Academia	1University of Pennsylvania 2Rutgers University 3The Hong Kong University of Science and Technology (Guangzhou). Correspondence to: Hui Xiong <EMAIL>, Yong Chen <EMAIL>.
Pseudocode	No	The paper describes the methodology in narrative text and mathematical formulas but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Our code and dataset are avaliable at https://github.com/yishanssss/SAFER.
Open Datasets	Yes	Experiments on two publicly available sepsis datasets demonstrate that SAFER outperforms state-of-the-art baselines across multiple recommendation metrics and counterfactual mortality rate, while offering robust formal assurances. For this study, we define cohorts based on the sepsis-3 criteria (Singer et al., 2016), focusing on the early stages of sepsis management 24 hours prior to and 48 hours after sepsis onset. The treatment selection involves intravenous fluid and vasopressor dosage within a 4-hour window, mapped to a 5 5 medical intervention space, following Komorowski et al. (2018). Figure 2 shows the distribution of sepsis treatment co-occurrence in the two cohorts.
Dataset Splits	Yes	The two datasets were randomly split into training, calibration (validation), and test sets in an 80%/10%/10% ratio via patient-level splits to ensure no patient overlap, under the assumption that the entire dataset is i.i.d. sampled from a common distribution.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments.
Software Dependencies	No	Specifically, we use Bio Clinical BERT 2 to encode clinical notes, modeled as X W (Alsentzer et al., 2019), which provides superior performance in encoding clinical text due to its bidirectional attention mechanism and domainspecific pretraining on large-scale biomedical and clinical corpora (Huang et al., 2023; Hu et al., 2024; Zhang et al., 2022). Footnote 2: https://huggingface.co/emilyalsentzer/Bio Clinical BERT. While Bio Clinical BERT is mentioned, no specific version number for this or any other software dependency is provided.
Experiment Setup	Yes	Appendix C.3 provides a sensitivity analysis of several hyperparameters, including the length of historical information, hidden dimension, and γ in the loss function. Historical sequence length L: ... we set the sequence length to 8 for all experiments. Hidden dimensionality hd: ... we choose 128 to reduce model parameters and improve computational efficiency. γ in the Loss Function: Figure 10 illustrates the performance of SAFER under different γ values, guiding the selection of an optimal γ.