Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping
Authors: Yichi Zhang, Molei Liu, Matey Neykov, Tianxi Cai
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital. [...] We conducted extensive simulation studies to examine the finite-sample performance of the PASS estimator and to compare it with existing approaches. [...] We examine the performance of PASS along with other approaches in three real world EHR phenotyping studies with the goal of developing classification models for the diseases of interest. |
| Researcher Affiliation | Academia | Yichi Zhang EMAIL Department of Computer Science and Statistics University of Rhode Island Molei Liu EMAIL Department of Biostatistics Harvard T.H. Chan School of Public Health Matey Neykov EMAIL Department of Statistics and Data Science Carnegie Mellon University Tianxi Cai EMAIL Department of Biostatistics Harvard T.H. Chan School of Public Health |
| Pseudocode | No | The paper describes the methodology using mathematical equations and textual explanations but does not include any distinct pseudocode blocks or algorithms. |
| Open Source Code | Yes | R codes for implementing PASS and the benchmark methods, and replicating the simulation results can be found at https://github.com/moleibobliu/PASS. |
| Open Datasets | Yes | This de-identified dataset has been analyzed in previous studies (Zhang et al., 2019, e.g.) and is publicly available online: https://celehs.github.io/PheCAP/articles/example2.html. |
| Dataset Splits | Yes | For each choice of eϑ W, we consider the area under the receiver operating characteristic curve (AUC) for classifying Y , the excess risk (ER) as defined in Section 3, and the mean squared error of the predicted probabilities (MSE-P) which is the mean squared differences between the predicted probability and the true probability. We summarize results based on 1000 simulated datasets for each configuration. [...] First, we randomly split the labelled samples into four folds of equal sizes. Then we pick each fold as the validation set, sample n training labels from the other three folds for 20 times, train and validate the algorithms, and finally average the evaluation metrics and their standard errors over the validation results on the four folds. We replicate this procedure 10 times and report the average performance. |
| Hardware Specification | No | The paper does not specify any particular hardware (GPU, CPU models, etc.) used for running the experiments. |
| Software Dependencies | No | In this paper, we use the R package glmnet (Friedman et al., 2010) to compute bζ, bγ, bρ, and bδ, and construct the final estimator for ϑ0 as bϑ = (bζ, bγ, bβ ) with bβ = bδ + bρbα. The version number for the R package 'glmnet' or R itself is not specified. |
| Experiment Setup | Yes | Throughout, we let N = 10000 and let ν = 1 in the ALASSO weights. We use Bayesian information criterion (BIC) to select µinit and µ in the estimation of α due to large N, and use 10-fold cross-validation to select λ1, λ2 for the estimation of β, so that the phenotype model is tuned towards prediction performance. [...] For the size of training labels, we consider n = 50, 70, 90. [...] For the size of training labels, we consider n = 50, 125, 200. [...] For the size of training labels, we consider 50, 85, 120. |