reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Predictive Methods to Missing Data Imputation: An Optimization Approach

Authors: Dimitris Bertsimas, Colin Pawlowski, Ying Daisy Zhuo

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the methods in computational experiments using 84 real-world data sets taken from the UCI Machine Learning Repository. We demonstrate that our proposed method improves out-of-sample accuracy in large-scale computational experiments across a sample of 84 data sets taken from the UCI Machine Learning Repository.
Researcher Affiliation	Academia	Dimitris Bertsimas EMAIL Colin Pawlowski EMAIL Ying Daisy Zhuo EMAIL Sloan School of Management and Operations Research Center Massachusetts Institute of Technology Cambridge, MA 02139
Pseudocode	Yes	Algorithm 1, which we refer to as opt.impute, implements BCD or CD for Problem (1).
Open Source Code	No	The paper mentions implementation details like: "The implementation was in the programming language Julia with fast algorithms for K-nearest neighbor calculations." "The implementation was in Julia using the scikit-learn package in Python." "The implementation was in Julia using the packages glmnet and Optimal Trees." However, it does not explicitly state that the authors' own code for the methodology is open-source, nor does it provide a direct link to a code repository.
Open Datasets	Yes	We evaluate the methods in computational experiments using 84 real-world data sets taken from the UCI Machine Learning Repository.
Dataset Splits	Yes	First, we divide each downstream data set using a 50% training/testing split. Next, we randomly sample a ﬁxed percentage of the entries in X to be missing completely at random, ranging from 10% to 50%.
Hardware Specification	Yes	Each method was run on a single thread of a machine with an Intel Xeon CPU E5-2650 (2.00 GHz) Processor and limited to 8 GB RAM with a time limit of 4 hours.
Software Dependencies	No	The paper mentions several software packages like "MICE package in R", "pca Methods package in R", "scikit-learn package in Python", "glmnet and Optimal Trees" and the programming language "Julia" but does not provide specific version numbers for these components used in the authors' implementation or for the benchmark methods.
Experiment Setup	Yes	The hyperparameters that we tune via this method are summarized in Table 3. In addition, we also use this cross-validation procedure to select the best method out of opt.knn, opt.svm, and opt.tree.