DataWig: Missing Value Imputation for Tables
Authors: Felix Biessmann, Tammo Rukat, Phillipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, David Salinas
JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that Data Wig compares favourably to existing imputation packages. We observe that Data Wig compares favourably with other implementations for numeric imputation, even in the difficult missing-not-at-random condition. When comparing Data Wig with mode imputation and string matching (Dallachiesa et al., 2013) Data Wig achieves a median F1-score of 60% across three tasks, imputation of the Wikipedia attributes birth-place, genre and location, with a simple n-gram model. |
| Researcher Affiliation | Collaboration | Felix Bießmann EMAIL Beuth University, Luxemburger Str. 10, 13353 Berlin (work done while at Amazon Research) Tammo Rukat EMAIL Phillipp Schmidt EMAIL Amazon Research, Krausenstr. 38, 10117 Berlin, Germany Prathik Naidu EMAIL Department of Computer Science, Stanford University Stanford, CA 94305, USA (work done while at Amazon Research) Sebastian Schelter EMAIL Center for Data Science, New York University, New York, USA (work done while at Amazon Research) Andrey Taptunov EMAIL Snowflake, Stresemannstraße 123, 10963 Berlin (work done while at Amazon Research) Dustin Lange EMAIL Amazon Research, Krausenstr. 38, 10117 Berlin, Germany David Salinas EMAIL Naver Labs, 6 Chemin de Maupertuis, 38240 Meylan, France (work done while at Amazon Research) |
| Pseudocode | Yes | Figure 1: Left: Available featurizers and loss functions for different data types in Data Wig. Right: Application example of Data Wig API for the use case shown in Figure 2. table = pandas.read_csv( products.csv ) missing = table[table[ color ]. isnull ()] # instantiate model and train imputer model = Simple Imputer ( input_columns =[ description , product_type , size ], output_columns =[ color ]) .fit(table) # impute missing values imputed = model.predict(missing) |
| Open Source Code | Yes | Source code, documentation, and unit tests for this package are available at: github.com/awslabs/datawig The software, unit tests, and all experiments are available under github.com/awslabs/datawig. |
| Open Datasets | Yes | All methods were evaluated on one synthetic linear and one synthetic non-linear problem and five real data sets available in sklearn. When comparing Data Wig with mode imputation and string matching (Dallachiesa et al., 2013) Data Wig achieves a median F1-score of 60% across three tasks, imputation of the Wikipedia attributes birth-place, genre and location, with a simple n-gram model. |
| Dataset Splits | No | test errors were obtained on a separate test set, for details and unnormalized results see benchmarks github repository. For each baseline method, grid search was performed for hyperparameter optimization on a validation set. |
| Hardware Specification | No | The paper mentions "efficient execution on both CPUs and GPUs" but does not provide specific models or configurations of the hardware used for experiments. |
| Software Dependencies | No | The paper mentions "Apache mxnet", "pandas dataframe" and "sklearn" but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | No | All hyperparameters and neural architectures are optimized using random search (Bergstra and Bengio, 2012), which can be constrained to a specified time limit. For Data Wig the Simple Imputer.complete function with random search for hyperparameter tuning was used. For each baseline method, grid search was performed for hyperparameter optimization on a validation set, test errors were obtained on a separate test set, for details and unnormalized results see benchmarks github repository. iterative imputation here means that 10 consecutive imputation rounds were performed for replacing the missing values in the input columns. |