reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer

Authors: Yulun Wu, Doron L Bergman

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, we show that our framework matches state-of-the-art performance on small classification tasks without filtering on dataset characteristics such as number of classes and number of missing values, while maintaining an average runtime under one second. On common benchmark dataset suites in both classification and regression, we show that adversarial pre-training was able to enhance Tab PFN s performance.
Researcher Affiliation	Collaboration	1University of California, Berkeley 2Capital One. Correspondence to: Yulun Wu <yulun EMAIL>.
Pseudocode	No	The paper describes methods and architectures but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	The APT model that we released and used for evaluations is the last model after the final iteration of pre-training. For more details on the data generator hyperparameters, see the code repository in our supplementary material.
Open Datasets	Yes	For classification, we used the curated open-source Open ML-CC18 dataset suite (Bischl et al., 2021) containing 68 popular tabular benchmark datasets (4 vision datasets mnist 784, CIFAR 10, Devnagari-Script, and Fashion-MNIST are not treated as tabular and removed from the total 72 datasets), and our main results are presented on all small datasets (number of samples no larger than 2,000) in Open ML-CC18... For regression benchmarking, we used the curated open-source Open ML-CTR23 dataset suite (Fischer et al., 2023).
Dataset Splits	Yes	The train-test split is set to 80-20 instead of the unconventional 50-50. For datasets with number of features larger than 100, we subsample 100 features similar to (Mc Elfresh et al., 2024). Standard deviations are calculated across 5 different splits.
Hardware Specification	Yes	The average runtime of APT increased by 4.6% compared to Tab PFN and remained within a second on GPU (NVIDIA H100), showing that neural modifications from the mixture block have not made APT significantly heavier.
Software Dependencies	No	The paper describes hyperparameters but does not explicitly list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	All common hyperparameters of APT are directly inherited from Tab PFN and not tuned, including learning rate 10 4, number of blocks 12, hidden dimensions 512, hidden feedforward dimensions 1024, number of heads 4, effective batch size (batch size per step number of gradient accumulation steps) 64, total number of training datasets (number of epochs steps per epoch number of datasets per step) 6, 400, 000, as well as all data generator hyperparameters. For APT, all common hyperparameters shared with Tab PFN are directly inherited from Tab PFN. The hyperparameter search space of benchmark models is directly inherited from Hollmann et al. (2022), and directly inherited from Mc Elfresh et al. (2024) if the benchmark model is not in Hollmann et al. (2022).