reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks

Authors: Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, Artem Babenko

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To this end, we introduce Tab Re D a collection of eight industry-grade tabular datasets. We reassess a large number of tabular ML models and techniques on Tab Re D. We demonstrate that evaluation on both time-based data splits and richer feature sets leads to different methods ranking, compared to evaluation on random splits and smaller number of features, which are common in academic benchmarks.
Researcher Affiliation	Collaboration	Ivan Rubachev= 1,2 Nikolay Kartashev= 2,1 Yury Gorishniy1 Artem Babenko1,2 1Yandex 2HSE University
Pseudocode	No	The paper describes methodologies and experimental setups but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	REPRODUCIBILITY STATEMENT We describe our experimental setup in subsection 5.1 and Appendix C. The code is available at https://github.com/yandex-research/tabred
Open Datasets	Yes	To this end, we introduce Tab Re D a collection of eight industry-grade tabular datasets. We summarize the main information about our benchmark in Table 2. The complete description of each included dataset can be found in Appendix B. Newly introduced datasets are avaialable at https://kaggle.com/Tab Re D
Dataset Splits	Yes	All Tab Re D datasets come with time-based splits into train, validation and test parts. Furthermore, because of additional investments in data acquisition and feature engineering, all datasets in Tab Re D have more features. This stems from adopting the preprocessing steps from production ML pipelines and Kaggle competition forums, where extensive data engineering is often highly prioritized. ... First, we use random splits instead of the time-based ones. We keep train validation and test set sizes the same and randomly shuffle objects to obtain the random splits.
Hardware Specification	No	The paper mentions general computing concepts like 'compute requirements' but does not specify any particular hardware components such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies	No	We tune hyperparameters for most methods5 using Optuna from Akiba et al. (2019), for DL models we use the Adam W optimizer and optimize MSE loss or binary cross entropy depending on the dataset. ... We include three main implementations of Gradient Boosted Decision Trees: XGBoost (Chen & Guestrin, 2016), Light GBM (Ke et al., 2017) and Cat Boost (Prokhorenkova et al., 2018)... The paper mentions software and libraries by name, but without specific version numbers.
Experiment Setup	Yes	We adopt training, evaluation and tuning setup from Gorishniy et al. (2024). We tune hyperparameters for most methods5 using Optuna from Akiba et al. (2019), for DL models we use the Adam W optimizer and optimize MSE loss or binary cross entropy depending on the dataset. By default, each dataset is temporally split into train, validation and test sets. Each model is selected by the performance on the validation set and evaluated on the test set (both for hyperparameter tuning and early-stopping). Test set results are aggregated over 15 random seeds for all methods, and the standard deviations are taken into account to ensure the differences are statistically significant. We randomly subsample large datasets (Homecredit Default, Cooking Time, Delivery ETA and Weather) to make more extensive hyperparameter tuning feasible. For extended description of our experimental setup including data preprocessing, dataset statistics, statistical testing procedures and exact tuning hyperparameter spaces, see Appendix C. We run hyperparameter optimization for 100 iterations for most models, the exceptions are FTTransformer (which is significantly less efficient on datasets with hundreds of features) where we were able to run 25.