reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Authors: Riccardo Cappuzzo, Aimee Coelho, Félix Lefebvre, Paolo Papotti, Gaël Varoquaux

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present an in-depth analysis of such automated table augmentation for machine learning tasks, analyzing different methods for the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for benchmarking this data discovery task. Systematic exploration on both lakes outlines 1) the importance of accurately retrieving candidate tables to join, 2) the efficiency of simple merging methods, and 3) the resilience of tree-based learners to noisy conditions. Our experimental environment is easily reproducible and based on open data, to foster more research on feature engineering, auto ML, and learning in data lakes. We analyze various methods in the four tasks, with an exhaustive empirical evaluation that required approximately 21 years or 189k CPU and GPU hours (Table 7).
Researcher Affiliation	Collaboration	Riccardo Cappuzzo EMAIL SODA Team Inria Saclay Aimee Coelho EMAIL Dataiku Paris Felix Lefebvre EMAIL SODA Team Inria Saclay Paolo Papotti EMAIL EURECOM Biot Gael Varoquaux EMAIL SODA Team Inria Saclay
Pseudocode	Yes	Algorithm 1 Pseudocode of the Stepwise Greedy Join selector.
Open Source Code	Yes	YADL, the base tables, and the pipeline are available and easily extendable to spur further research. [...] The code to prepare YADL is at https: //github.com/soda-inria/YADL and the pipeline is at https://github.com/soda-inria/retrieve-merge-predict.
Open Datasets	Yes	We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for benchmarking this data discovery task. [...] The data lakes are available at https://doi.org/10.5281/zenodo.10600047, the code to prepare YADL is at https: //github.com/soda-inria/YADL and the pipeline is at https://github.com/soda-inria/retrieve-merge-predict.
Dataset Splits	Yes	To ensure reliable prediction results, a cross-validation setup repeats these steps over different train-test splits [...] Each selector receives as input a pool of K candidates from a given retrieval method on a given data lake, the train and test splits of the base table, and the aggregation method to use. The train split is further split into a training (0.8) and validation set (0.2). [...] Further detail on implementation details, pre-processing of the data, and cross-validation setup are provided in Appendix C.2.
Hardware Specification	No	We run our experimental campaign on a SLURM cluster, fixing the number of threads to 32. Nodes have at least 256GB of RAM. Experiments that involved NNs were run on nodes equipped with GPUs.
Software Dependencies	Yes	The implementation of the pipeline is in Python7; Exact Matching, aggregation and join operations are implemented using Polars (Vink et al., 2024) as backend. [...] We rely on the Python implementation of Min Hash provided by the Datasketch package (Zhu et al., 2023). [...] From references: "Ritchie Vink...pola-rs/polars: Python polars 0.20.6, January 2024." and "Eric Zhu...ekzhu/datasketch: v1.5.9, February 2023."
Experiment Setup	Yes	We fix the number of Cat Boost iterations to 300; we stop training the model 10 iterations after the optimal metric has been detected; we set the L2 regularization coefficient to 0.01. We use the default parameters for Ridge CV as used in the scikit-learn implementation. For Real MLP and Res Net we use the parameters that are set in the pytabkit package as they have been shown in Holzmüller et al. (2024) to be the better defaults. We fix the number of Stepwise Greedy Join iterations to 30: this number is consistent with the number of candidates that are provided in the retrieval step. We use a containment threshold of 0.2 for the preparation of the Min Hash index, and clamp the number of candidates returned by each retrieval method to 30 (except when specified otherwise).