reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evaluation of Active Feature Acquisition Methods for Time-varying Feature Settings

Authors: Henrik von Kleist, Alireza Zamanian, Ilya Shpitser, Narges Ahmidi

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Section 8, we present synthetic data experiments that exemplify the improved data eﬃciency and reduced positivity requirements of the semi-oﬀline RL estimators. Our experiments also show that biased evaluation methods commonly used in the AFA literature can lead to detrimental conclusions regarding the performance of AFA agents. Deploying such methods without caution may pose signiﬁcant risks to patients lives. We end the paper with a Discussion (Section 9) and Conclusion (Section 10).
Researcher Affiliation	Collaboration	Henrik von Kleist1,2,3 EMAIL Alireza Zamanian2,4 EMAIL Ilya Shpitser3 EMAIL Narges Ahmidi1,3,4 EMAIL 1Institute of AI for Health, Helmholtz Munich German Research Center for Environmental Health, 85764 Neuherberg, Germany 2TUM School of Computation, Information and Technology, Technical University of Munich, 85748 Garching, Germany 3Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA 4Fraunhofer Institute for Cognitive Systems IKS, 80686 Munich, Germany
Pseudocode	No	The paper describes methods primarily through mathematical formulations and textual explanations of concepts and algorithms, but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	We evaluate the diﬀerent estimators on synthetic data sets where the missingness is artiﬁcially induced to allow the comparison with the ground truth. For the experiments, we deﬁned a superfeature as a feature that comprises multiple subfeatures, which are acquired jointly and which have a single cost. Furthermore, we assumed a subset of features is available at no cost (free features) and set ﬁxed acquisition costs cacq for the remaining features. A prediction was to be performed at each time step, which corresponds to the setting described in Appendix K. We chose misclassiﬁcation costs such that good policies must ﬁnd a balance between the feature acquisition cost and the predictive value of the features. We evaluated and compared the described methods on synthetic data sets with and without violation of either the NDE or NUC assumption. In experiments where the NDE assumption holds, the features are distributed according to: ( γi Xt 1 (1),i + (1 γi)ϵi, if t > 0 ϵi, if t = 0. where ϵi N(0, σ). In experiments with a violation of the NDE assumption, the unobserved variables U were distributed according to: γi Ut 1 i + (1 γi)ϵi + 0.5 P i At 1 i if t > 1 γi Ut 1 i + (1 γi)ϵi, if t = 1 ϵi, if t = 0. The labels are distributed according to p(Y t = 1) = ( 1, if ζ1 P i Wi Xt (1),i + ζ2 P i Wi Xt 1 (1),i > 0 0.3, otherwise. This choice for Y simulates a scenario where not all data points are equally easy to classify. The retrospective policy πβ follows diﬀerent logistic models depending on whether a MAR assumption (NUC holds) or MNAR assumption (NUC is violated) is assumed, as speciﬁed in Table 3. To evaluate the convergence of diﬀerent estimators when the NDE assumption holds, we consider the average cost of running the AFA agent on the data set over all data points in the ground truth test set (without missingness) as the true expected cost J. When NDE is violated, we sample the ground truth data generating process while running the agent and do so the same number of times as there are data points in the test set. We performed ﬁve diﬀerent experiments: ... For full experiment conﬁgurations for the acquisition processes, please see Table 4.
Dataset Splits	Yes	Sample size n D 100 000 divided into 30% training set (for agent and classiﬁer), 30% nuisance function training set, and 40% test set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or cloud resources with specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using 'logistic regression' models and a 'proximal policy optimization (PPO) RL agent' as methods, but does not specify any software libraries, frameworks (e.g., PyTorch, TensorFlow), or their version numbers. It also refers to 'impute-then-regress classiﬁer (Le Morvan et al., 2021)' but without detailing specific software versions for its implementation.
Experiment Setup	Yes	We used an impute-then-regress classiﬁer (Le Morvan et al., 2021) with unconditional mean imputation and a logistic regression classiﬁer for the classiﬁcation task and trained it on the available and further randomly subsampled data (where p(At i = 1) = 0.5). We tested random and ﬁxed acquisition policies that acquire each costly feature with a 50% or 100% probability. Furthermore, we evaluated a proximal policy optimization (PPO) RL agent (Schulman et al., 2017), which was trained on the semi-oﬄine sampling distribution p using πα as the semi-oﬄine sampling policy, but without adjustment for the blocking of actions. ... PPO (learning rate: 0.0001, number of layers: 2, hidden layer neurons per layer: 64, hidden layer activation function: tanh). Nuisance functions ˆπβ (logistic regression), ˆQSemi (Ξ = , learning rate: 0.001, number of layers: 2, hidden layer neurons per layer: 16, hidden layer activation function: Re LU).