reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Conformalized Survival Analysis for General Right-Censored Data

Authors: Hen Davidov, Shai Feldman, Gil Shamai, Ron Kimmel, Yaniv Romano

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate the informativeness and validity of our methods in simulated settings and showcase their practical utility using several real-world datasets.
Researcher Affiliation	Academia	Hen Davidov , Shai Feldman , Gil Shamai , Ron Kimmel , Yaniv Romano EMAIL, EMAIL, EMAIL
Pseudocode	Yes	A formal description of the algorithm for the naive calibration method is given by Algorithm 1. ... A formal description of the focused calibration algorithm is given in Algorithm 2. ... A formal description of the fused calibration method is presented in Algorithm 3.
Open Source Code	Yes	A Python implementation of our methods is provided in our github repository.
Open Datasets	Yes	We demonstrate the practical utility of our methods by applying them to six real-world datasets: The Northern Alberta Cancer Dataset (NACD) (Haider et al., 2020), Rotterdam & German Breast Cancer Study Group (GBSG), Molecular Taxonomy of Breast Cancer International Consortium (METABRIC), Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT) (Kvamme et al., 2019; Katzman et al., 2018), a user churn dataset (Fotso et al., 2019 present), as well as The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) multimodal dataset collection (Tomczak et al., 2015). ... The synthetic data generation function and the processed TCGA-BRCA dataset are available in the github repository.
Dataset Splits	Yes	In all experiments, the dataset was split into four parts: 60% for training, 20% for calibration, 10% for validation (used for early stopping), and 10% for testing to evaluate performance. ... The reported performance metrics are evaluated on 50 independent trials, each consisting of newly sampled train, validation, calibration, and test sets of sizes 600, 200, 1000, and 200, respectively.
Hardware Specification	Yes	CPU: AMD EPYC 7443 24-Core Processor GPU: NVIDIA RTX A6000 OS: Ubuntu 20.04
Software Dependencies	No	The paper mentions software like 'Deep Surv method (Katzman et al., 2018)', 'pycox package (Kvamme et al., 2019), implemented using a Py Torch MLP regressor', and 'scikit-learn (Pedregosa et al., 2011) to train a Random Forrest Classifiers'. However, specific version numbers for PyTorch or scikit-learn are not provided.
Experiment Setup	Yes	In all experiments, we approximate the distribution of T \| X using the Deep Surv method (Katzman et al., 2018), implemented in the pycox package (Kvamme et al., 2019), implemented using a Py Torch MLP regressor with Re LU activation, early stopping (triggered after 5 epochs without improvement), and a training cycle of 1000 epochs. The Adam optimizer optimized the model with parameters lr = 1e 3, β1 = 0.9 and β2 = 0.999, a batch size of 256, dropout layers with a rate of p = 0.1, batch normalization layers, and varying configurations of hidden layers, detailed in Table 4. These configurations were selected to be similar to those found in the Py Cox notebooks, with the real-world datasets getting a deeper model to account for their more complex and interconnected nature. Additionally, we employed scikit-learn (Pedregosa et al., 2011) to train a Random Forrest Classifiers with a max depths of 4 and 2, to estimate the weights ˆwτ and the indicator ˆsτ respectively.