From Predictive Methods to Missing Data Imputation: An Optimization Approach
Authors: Dimitris Bertsimas, Colin Pawlowski, Ying Daisy Zhuo
JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the methods in computational experiments using 84 real-world data sets taken from the UCI Machine Learning Repository. We demonstrate that our proposed method improves out-of-sample accuracy in large-scale computational experiments across a sample of 84 data sets taken from the UCI Machine Learning Repository. |
| Researcher Affiliation | Academia | Dimitris Bertsimas EMAIL Colin Pawlowski EMAIL Ying Daisy Zhuo EMAIL Sloan School of Management and Operations Research Center Massachusetts Institute of Technology Cambridge, MA 02139 |
| Pseudocode | Yes | Algorithm 1, which we refer to as opt.impute, implements BCD or CD for Problem (1). |
| Open Source Code | No | The paper mentions implementation details like: "The implementation was in the programming language Julia with fast algorithms for K-nearest neighbor calculations." "The implementation was in Julia using the scikit-learn package in Python." "The implementation was in Julia using the packages glmnet and Optimal Trees." However, it does not explicitly state that the authors' own code for the methodology is open-source, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We evaluate the methods in computational experiments using 84 real-world data sets taken from the UCI Machine Learning Repository. |
| Dataset Splits | Yes | First, we divide each downstream data set using a 50% training/testing split. Next, we randomly sample a fixed percentage of the entries in X to be missing completely at random, ranging from 10% to 50%. |
| Hardware Specification | Yes | Each method was run on a single thread of a machine with an Intel Xeon CPU E5-2650 (2.00 GHz) Processor and limited to 8 GB RAM with a time limit of 4 hours. |
| Software Dependencies | No | The paper mentions several software packages like "MICE package in R", "pca Methods package in R", "scikit-learn package in Python", "glmnet and Optimal Trees" and the programming language "Julia" but does not provide specific version numbers for these components used in the authors' implementation or for the benchmark methods. |
| Experiment Setup | Yes | The hyperparameters that we tune via this method are summarized in Table 3. In addition, we also use this cross-validation procedure to select the best method out of opt.knn, opt.svm, and opt.tree. |