reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DataWig: Missing Value Imputation for Tables

Authors: Felix Biessmann, Tammo Rukat, Phillipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, David Salinas

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that Data Wig compares favourably to existing imputation packages. We observe that Data Wig compares favourably with other implementations for numeric imputation, even in the diﬃcult missing-not-at-random condition. When comparing Data Wig with mode imputation and string matching (Dallachiesa et al., 2013) Data Wig achieves a median F1-score of 60% across three tasks, imputation of the Wikipedia attributes birth-place, genre and location, with a simple n-gram model.
Researcher Affiliation	Collaboration	Felix Bießmann EMAIL Beuth University, Luxemburger Str. 10, 13353 Berlin (work done while at Amazon Research) Tammo Rukat EMAIL Phillipp Schmidt EMAIL Amazon Research, Krausenstr. 38, 10117 Berlin, Germany Prathik Naidu EMAIL Department of Computer Science, Stanford University Stanford, CA 94305, USA (work done while at Amazon Research) Sebastian Schelter EMAIL Center for Data Science, New York University, New York, USA (work done while at Amazon Research) Andrey Taptunov EMAIL Snowﬂake, Stresemannstraße 123, 10963 Berlin (work done while at Amazon Research) Dustin Lange EMAIL Amazon Research, Krausenstr. 38, 10117 Berlin, Germany David Salinas EMAIL Naver Labs, 6 Chemin de Maupertuis, 38240 Meylan, France (work done while at Amazon Research)
Pseudocode	Yes	Figure 1: Left: Available featurizers and loss functions for diﬀerent data types in Data Wig. Right: Application example of Data Wig API for the use case shown in Figure 2. table = pandas.read_csv( products.csv ) missing = table[table[ color ]. isnull ()] # instantiate model and train imputer model = Simple Imputer ( input_columns =[ description , product_type , size ], output_columns =[ color ]) .fit(table) # impute missing values imputed = model.predict(missing)
Open Source Code	Yes	Source code, documentation, and unit tests for this package are available at: github.com/awslabs/datawig The software, unit tests, and all experiments are available under github.com/awslabs/datawig.
Open Datasets	Yes	All methods were evaluated on one synthetic linear and one synthetic non-linear problem and ﬁve real data sets available in sklearn. When comparing Data Wig with mode imputation and string matching (Dallachiesa et al., 2013) Data Wig achieves a median F1-score of 60% across three tasks, imputation of the Wikipedia attributes birth-place, genre and location, with a simple n-gram model.
Dataset Splits	No	test errors were obtained on a separate test set, for details and unnormalized results see benchmarks github repository. For each baseline method, grid search was performed for hyperparameter optimization on a validation set.
Hardware Specification	No	The paper mentions "eﬃcient execution on both CPUs and GPUs" but does not provide specific models or configurations of the hardware used for experiments.
Software Dependencies	No	The paper mentions "Apache mxnet", "pandas dataframe" and "sklearn" but does not provide specific version numbers for any of these software dependencies.
Experiment Setup	No	All hyperparameters and neural architectures are optimized using random search (Bergstra and Bengio, 2012), which can be constrained to a speciﬁed time limit. For Data Wig the Simple Imputer.complete function with random search for hyperparameter tuning was used. For each baseline method, grid search was performed for hyperparameter optimization on a validation set, test errors were obtained on a separate test set, for details and unnormalized results see benchmarks github repository. iterative imputation here means that 10 consecutive imputation rounds were performed for replacing the missing values in the input columns.