Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer
Authors: Yulun Wu, Doron L Bergman
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we show that our framework matches state-of-the-art performance on small classification tasks without filtering on dataset characteristics such as number of classes and number of missing values, while maintaining an average runtime under one second. On common benchmark dataset suites in both classification and regression, we show that adversarial pre-training was able to enhance Tab PFN s performance. |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley 2Capital One. Correspondence to: Yulun Wu <yulun EMAIL>. |
| Pseudocode | No | The paper describes methods and architectures but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The APT model that we released and used for evaluations is the last model after the final iteration of pre-training. For more details on the data generator hyperparameters, see the code repository in our supplementary material. |
| Open Datasets | Yes | For classification, we used the curated open-source Open ML-CC18 dataset suite (Bischl et al., 2021) containing 68 popular tabular benchmark datasets (4 vision datasets mnist 784, CIFAR 10, Devnagari-Script, and Fashion-MNIST are not treated as tabular and removed from the total 72 datasets), and our main results are presented on all small datasets (number of samples no larger than 2,000) in Open ML-CC18... For regression benchmarking, we used the curated open-source Open ML-CTR23 dataset suite (Fischer et al., 2023). |
| Dataset Splits | Yes | The train-test split is set to 80-20 instead of the unconventional 50-50. For datasets with number of features larger than 100, we subsample 100 features similar to (Mc Elfresh et al., 2024). Standard deviations are calculated across 5 different splits. |
| Hardware Specification | Yes | The average runtime of APT increased by 4.6% compared to Tab PFN and remained within a second on GPU (NVIDIA H100), showing that neural modifications from the mixture block have not made APT significantly heavier. |
| Software Dependencies | No | The paper describes hyperparameters but does not explicitly list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | All common hyperparameters of APT are directly inherited from Tab PFN and not tuned, including learning rate 10 4, number of blocks 12, hidden dimensions 512, hidden feedforward dimensions 1024, number of heads 4, effective batch size (batch size per step number of gradient accumulation steps) 64, total number of training datasets (number of epochs steps per epoch number of datasets per step) 6, 400, 000, as well as all data generator hyperparameters. For APT, all common hyperparameters shared with Tab PFN are directly inherited from Tab PFN. The hyperparameter search space of benchmark models is directly inherited from Hollmann et al. (2022), and directly inherited from Mc Elfresh et al. (2024) if the benchmark model is not in Hollmann et al. (2022). |