reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Prediction models that learn to avoid missing values

Authors: Lena Stempfle, Anton Matsson, Newton Mwai, Fredrik D. Johansson

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on real-world datasets demonstrate that MA-DT, MA-LASSO, MA-RF, and MA-GBT effectively reduce the reliance on features with missing values while maintaining predictive performance competitive with their unregularized counterparts. This shows that our framework gives practitioners a powerful tool to maintain interpretability in predictions with test-time missing values. 6. Experiments We demonstrate the MA framework in a suite of experiments, aiming to show that MA models reduce reliance on missing values while preserving predictive performance. We compare MA-DT, MA-LASSO, MA-GBT, and MA-RF to standard models as well as to models designed to handle missing values, using different imputation methods.
Researcher Affiliation	Academia	1Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-41296 Gothenburg, Sweden. Correspondence to: Lena Stempfle <EMAIL>.
Pseudocode	Yes	We implement both an MA random forest (MA-RF) as well as MA gradient boosted trees (MA-GBT). Our implementations are based on the corresponding meta-estimators in the scikit-learn library, with MA trees replacing the default tree estimators for both classification and regression tasks. In MA-RF, the individual trees are fit independently, allowing the tree-building process to be parallelized. The parameter σi,j in the regularized node splitting criterion in (4) is set to 1 for all i, j. MA-GBT is a gradient-boosting classifier specifically designed to minimize missingness reliance. At the core, it follows the structure of standard gradient-boosting, where the model hm(x) added at step m fits the pseudo-residuals of the ensemble em 1(x) learned up to that point using the splitting objective in (4) with a regression criterion (mean squared error) for C. See Algorithm 1 in Appendix B for the pseudo algorithm for MA-GBT.
Open Source Code	Yes	The code is available at https://github.com/Healthy-AI/malearn.
Open Datasets	Yes	Datasets. We study six datasets with varying degrees of missingness. We sample 10,000 samples from the National Health and Nutrition Examination Survey (NHANES) for hypertension prediction with 42 features (Johnson et al., 2013). LIFE (2,864 samples) predicts whether life expectancy is above or below the median using 18 features (World Health Organization (WHO), 2021). ADNI (1,337 samples) predicts diagnosis changes in patients with suspected dementia using 39 features (Weiner et al., 2010). Breast Cancer (1,756 samples) includes 16 features for cancer prediction (Razavi et al., 2018). Pharyngitis (676 samples) is used for pharyngitis prediction using 18 features (Miyagi, 2023). Explainable Machine Learning Challenge dataset, referred to as FICO, consists of 10,549 samples and is used for credit risk classification with 23 features (FICO et al., 2018).
Dataset Splits	Yes	Experimental Setup. We divide each dataset into training and testing subsets using an 80/20 split. For hyperparameter selection, including α, we perform 3-fold cross-validation, evaluating candidate models based on a combination of the area under the receiver operating characteristic curve (AUROC) and empirical missingness reliance (ˆρ). This procedure is repeated for 5 different splits of the dataset
Hardware Specification	No	The computations and data handling were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.
Software Dependencies	No	For Neumiss we used Py Torch (Paszke et al., 2019) in combination with skorch (Tietz et al., 2017). ... The logistic regression model, the decision tree classifier and the random forest classifier were implemented using the scikit-learn library (Pedregosa et al., 2011). For XGBoost, we used the XGBoost Python Package (see https://xgboost.readthedocs.io/ en/stable/python/index.html). For MINTY and M-GAM, we adapted code provided by Stempfle & Johansson (2024) and Mc Tavish et al. (2024).
Experiment Setup	Yes	Experimental Setup. We divide each dataset into training and testing subsets using an 80/20 split. For hyperparameter selection, including α, we perform 3-fold cross-validation, evaluating candidate models based on a combination of the area under the receiver operating characteristic curve (AUROC) and empirical missingness reliance (ˆρ). Specifically, we select the candidate model with the lowest ˆρ among those achieving at least 95 percent of the maximum AUROC. This procedure is repeated for 5 different splits of the dataset, and we report 95 % confidence intervals based on the bootstrap distribution of AUROC and ˆρ. ... Appendix C.3 provides details on the experimental setup, including hyperparameters.