reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

X-CAL: Explicit Calibration for Survival Analysis

Authors: Mark Goldstein, Xintian Han, Aahlad Puli, Adler Perotte, Rajesh Ranganath

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we ﬁt a variety of shallow and deep models on simulated data, a survival dataset based on MNIST, on lengthof-stay prediction using MIMIC-III data, and on brain cancer data from The Cancer Genome Atlas. We show that the models we study can be miscalibrated. We give experimental evidence on these datasets that X-CAL improves D-CALIBRATION without a large decrease in concordance or likelihood.
Researcher Affiliation	Academia	Mark Goldstein New York University EMAIL Xintian Han New York University EMAIL Aahlad Puli New York University EMAIL Adler J. Perotte Columbia University EMAIL Rajesh Ranganath New York University EMAIL
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/rajesh-lab/X-CAL
Open Datasets	Yes	We simulate a survival dataset conditionally on the MNIST dataset [Le Cun et al., 2010]. predict the length of stay (in number of hours) in the ICU, using data from the MIMIC-III dataset [Johnson et al., 2016], the glioma (a type of brain cancer) dataset 4 collected as part of the TCGA program and studied in [Network, 2015]. 4https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/studiedcancers/glioma
Dataset Splits	Yes	We sample a train/validation/test sets with 100k/50k/50k datapoints, respectively. We use Py Torch s MNIST with test split into validation/test. There are 2, 925, 434 and 525, 912 instances in the training and test sets. We split the training set in half for train and validation. The train/validation/test sets are made of 552/276/277 datapoints respectively
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies	No	The paper mentions 'Py Torch s MNIST' and 'Lifelines package' but does not specify version numbers for these software components.
Experiment Setup	No	The paper states 'We use γ = 10000.' and 'We use 20 D-CALIBRATION bins disjoint over [0, 1] for all experiments except for the cancer data, where we use 10 bins as in Haider et al. [2020].' and 'All reported results are an average of three seeds.' but does not include other specific hyperparameters like learning rates, batch sizes, or optimizer details.