TabM: Advancing tabular deep learning with parameter-efficient ensembling
Authors: Yury Gorishniy, Akim Kotelnikov, Artem Babenko
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This study highlights a major, yet so far overlooked opportunity for designing substantially better MLP-based tabular architectures. Namely, our new model Tab M relies on efficient ensembling, where one Tab M efficiently imitates an ensemble of MLPs and produces multiple predictions per object. Compared to a traditional deep ensemble, in Tab M, the underlying implicit MLPs are trained simultaneously, and (by default) share most of their parameters, which results in significantly better performance and efficiency. Using Tab M as a new baseline, we perform a large-scale evaluation of tabular DL architectures on public benchmarks in terms of both task performance and efficiency, which renders the landscape of tabular DL in a new light. Generally, we show that MLPs, including Tab M, form a line of stronger and more practical models compared to attention- and retrieval-based architectures. In particular, we find that Tab M demonstrates the best performance among tabular DL models. Then, we conduct an empirical analysis on the ensemble-like nature of Tab M. |
| Researcher Affiliation | Collaboration | Yury Gorishniy Yandex Akim Kotelnikov HSE University, Yandex Artem Babenko Yandex |
| Pseudocode | No | The paper describes the architecture and methods in detail using prose and mathematical notation (e.g., in Section 3 and 3.2), but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The code is available at: https://github.com/yandex-research/tabm. |
| Open Datasets | Yes | Our benchmark consists of 46 publicly available datasets used in prior work, including Grinsztajn et al. (2022); Gorishniy et al. (2024); Rubachev et al. (2024). The main properties of our benchmark are summarized in Table 1, and more details are provided in Appendix C. |
| Dataset Splits | Yes | Domain-aware splits. We pay extra attention to datasets with what we call domain-aware splits, including the eight datasets from the Tab Re D benchmark (Rubachev et al., 2024) and the Microsoft dataset (Qin & Liu, 2013). For these datasets, their original real-world splits are available, e.g. time-aware splits as in Tab Re D. ... The random splits of the remaining 37 datasets are inherited from prior work. |
| Hardware Specification | Yes | Most of the experiments were conducted on a single NVIDIA A100 GPU. In rare exceptions, we used a machine with a single NVIDIA 2080 Ti GPU and Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz. |
| Software Dependencies | No | Additionally, in this section, we mark with the asterisk (*) the versions of Tab M enhanced with two efficiency-related plugins available out-of-the-box in Py Torch (Paszke et al., 2019): the automatic mixed precision (AMP) and torch.compile (Ansel et al., 2024). ... For numerical features, by default, we used a slightly modified version of the quantile normalization from the Scikit-learn package (Pedregosa et al., 2011) (see the source code)... For DL-based algorithms, we minimize cross-entropy for classification problems and mean squared error for regression problems. We use the Adam W optimizer (Loshchilov & Hutter, 2019). ... In most cases, hyperparameter tuning is performed with the TPE sampler (typically, 50-100 iterations) from the Optuna package (Akiba et al., 2019). The paper mentions PyTorch, Scikit-learn, AdamW, and Optuna with citations but does not provide explicit version numbers for these software components. |
| Experiment Setup | Yes | Most importantly, on each dataset, a given model undergoes hyperparameter tuning on the validation set, then the tuned model is trained from scratch under multiple random seeds, and the test metric averaged over the random seeds becomes the final score of the model on the dataset. ... For DL-based algorithms, we minimize cross-entropy for classification problems and mean squared error for regression problems. We use the Adam W optimizer (Loshchilov & Hutter, 2019). We do not apply learning rate schedules. We do not use data augmentations. We apply global gradient clipping to 1.0. For each dataset, we used a predefined dataset-specific batch size. We continue training until there are patience consecutive epochs without improvements on the validation set; we set patience = 16 for the DL models. ... Hyperparameter tuning spaces for most models are provided in individual sections below (example for Tab M: subsection D.9). |