reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Feature Importance Metrics in the Presence of Missing Data

Authors: Henrik Von Kleist, Joshua Wendland, Ilya Shpitser, Carsten Marr

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Missing data estimation methods cannot be tested on real-world data with real missingness due to the unavailability of ground truth features X(1). To address this limitation, we perform a series of synthetic experiments to illustrate the differences between feature importance metrics, the impact of positivity violations, and the signiﬁcance of appropriate estimation methods. Using synthetic data, we illustrate key differences between these metrics and the risks of conﬂating them.
Researcher Affiliation	Academia	1Institute of AI for Health, Helmholtz Munich German Research Center for Environmental Health, Neuherberg, Germany 2TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany 3Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA 4Faculty of Computer Science, Ruhr University Bochum Bochum, Germany
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement about releasing code or a link to a code repository for the methodology described.
Open Datasets	No	Missing data estimation methods cannot be tested on real-world data with real missingness due to the unavailability of ground truth features X(1). To address this limitation, we perform a series of synthetic experiments to illustrate the differences between feature importance metrics, the impact of positivity violations, and the signiﬁcance of appropriate estimation methods.
Dataset Splits	Yes	We generate 100,000 data points and split them into 30% for training the classiﬁer, 30% for training the measurement policy, and 40% for testing.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	We used an "impute-then-regress" classiﬁer (Le Morvan et al., 2021) with zero imputation and a temporal convolutional network (TCN) (Bai et al., 2018) to classify labels Y t. This mentions software components (TCN) but does not specify their version numbers.
Experiment Setup	Yes	The classiﬁer uses four layers, with 32 channels per layer, a batch size of 2,000, dropout rate of 0.2, and a learning rate of 0.001. The detailed conﬁgurations for each experiment, including the data-generating process parameters (W,γ) and missingness mechanisms (π), are provided in Table 1.