reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Confidence Intervals and Simultaneous Confidence Bands Based on Deep Learning

Authors: Asaf Ben Arie, Malka Gorfine

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The validity of the proposed approach is demonstrated through an extensive simulation study, which shows that the method is accurate (i.e., valid and not overly conservative) as long as the network is sufficiently deep to ensure that the estimators provided by the deep neural network exhibit minimal bias. The utility of the proposed approach is demonstrated through two applications: constructing simultaneous confidence bands for survival curves generated by deep neural networks dealing with right-censored survival data, and constructing a confidence interval for classification probabilities in the context of binary classification regression.
Researcher Affiliation	Academia	Asaf Ben Arie EMAIL Department of Statistics and Operations Research Tel Aviv University, Israel; Malka Gorfine EMAIL Department of Statistics and Operations Research Tel Aviv University, Israel
Pseudocode	Yes	The following is the complete algorithm for generating a simultaneous confidence band of S( \|x) at level of (1 α)100%: 1. Generate an ensemble estimator, for example, b SM n (s\|x) = M 1 PM m=1 b Sn,m(s\|x, bθn,m), s [0, τ]. 2. For any bootstrap sample b, b = 1, . . . , B, get b S(b) n (s\|x; bθ(b) n ) = bµ(b) n (x, bθ(b) n ), s [0, τ], and d(b)(x) = max s [0,τ] b S(b) n (s\|x; bθ(b) n ) b SM n (s\|x) . 3. Get the 1 α percentile of d(1)(x), . . . , d(B)(x), denoted by dboots 1 α (x). 4. For any s [0, τ], we define LKS(s\|x) = max n b Sn(s\|x; bθn) dboots 1 α (x) , 0 o and U KS(s\|x) = min n b Sn(s\|x; bθn) + dboots 1 α (x) , 1 o .
Open Source Code	Yes	Code for the data analysis and reported simulation is available at Githubsite: https://github.com/Asafba123/Survival_bootstrap.
Open Datasets	Yes	In this section, we analyze four commonly used survival datasets to demonstrate the utility of our proposed approach. These datasets were introduced and used by Katzman et al. (2016) and Kvamme et al. (2019) among others, and are available through Py Cox. Here are some details of the datasets: SUPPORT: Study to Understand Prognoses Preferences Outcomes and Risks of Treatment. METABRIC: Molecular Taxonomy of Breast Cancer International Consortium. Rot. & GBSG: Rotterdam tumor bank and German Breast Cancer Study Group. FLCHAIN: Assay of Serum Free Light Chain.
Dataset Splits	Yes	The studied sample size of the training plus validation data range from n = 1, 000 to 10,000, with an 80%-20% split between training and validation. ... We conducted the simulations with sample sizes of n = 1000, 2000 and 5000, splitting the data into 80% for training and 20% for validation. ... The held-out fold serves as the test set, while the remaining data were split 80%-20% into training and validation sets.
Hardware Specification	No	The paper mentions GPUs and TPUs in a general context but does not specify any particular hardware used for running the experiments.
Software Dependencies	No	The analysis was conducted using the coxtime package in Python with Kaiming initialization. We employed a dropout rate of 0.1 and a batch size of 1000. The learning rate was dynamically determined using the lrfinder method, and the Adam optimizer was used for optimization. The implementation was done in Keras. The paper mentions software packages like 'coxtime package in Python', 'Adam optimizer', and 'Keras' but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We employed a dropout rate of 0.1 and a batch size of 1000. The learning rate was dynamically determined using the lrfinder method, and the Adam optimizer was used for optimization. The networks were standard multilayer DNNs with Re LU activation and batch normalization between layers. ... The training was conducted for 1500 epochs. ... Hyperparameters tuning was conducted using cross-validation. A grid search over the hyperparameters search space, as detailed in Table 1, was performed by splitting the data into 10 folds for each configuration and scoring the C-index on the held-out set. The set of hyperparameters with the highest average C-index was selected. The following are the selected hyperparameters for each dataset: SUPPORT 4 hidden layer, layer width of 256, dropout 0.3 and batch size of 256; METABRIC 2 hidden layer, layer width of 256, dropout 0.3 and batch size of 1024; Rot.&GBSG 1 hidden layer, layer width of 128, dropout 0.1 and batch size of 256; and FLCHAIN 1 hidden layer, layer width of 256, dropout 0.1 and batch size of 256.