reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TabPFN Unleashed: A Scalable and Effective Solution to Tabular Classification Problems

Authors: Siyang Liu, Han-Jia Ye

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on over 200 benchmark classification datasets demonstrate that BETA either outperforms or matches state-of-the-art methods. We revisit existing variants of Tab PFN and observe that most approaches focus either on reducing bias or variance, often neglecting the need to address the other side, while also increasing inference overhead. To fill this gap, we propose BETA (Bagging and Encoder-based Fine-tuning for Tab PFN Adaptation), a novel and effective method designed to minimize both bias and variance. To reduce bias, we introduce a lightweight encoder to better align downstream tasks with the pre-trained Tab PFN.
Researcher Affiliation	Academia	1School of Artificial Intelligence, Nanjing University, China 2National Key Laboratory for Novel Software Technology, Nanjing University, China. Correspondence to: Han-Jia Ye <EMAIL>.
Pseudocode	No	The paper describes the methodology textually in Section 3 and its subsections, outlining the components and steps involved, but does not present a formal pseudocode or algorithm block.
Open Source Code	Yes	The code is available at https://github.com/LAMDA-Tabular/BETA.
Open Datasets	Yes	In our experiments, we evaluate BETA on one of the largest publicly available tabular benchmark TALENT (Ye et al., 2024b), which includes 120 binary classification datasets and 80 multi-class classification datasets. To assess the effectiveness of BETA on high-dimensional datasets, we conducted experiments on 20 datasets with extremely high feature dimensions, as detailed in Table 2. The generalization error of Tab PFN and its variants on two real-world datasets: Adult (Barry & Ronny, 1996) and Bank (S. et al., 2014).
Dataset Splits	Yes	For the TALENT datasets, we follow the evaluation protocol from Gorishniy et al. (2021) and Ye et al. (2024b). Each dataset is randomly split into training, validation, and test sets with proportions of 64%, 16%, and 20%, respectively. For each dataset, we train each model using 15 different random seeds and calculate the average performance on the test set.
Hardware Specification	Yes	Most of the experiments were performed with four NVIDIA Ge Force RTX 4090 GPUs, four NVIDIA RTX 6000 Ada GPUs, and eight NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies	No	The paper mentions specific tools and optimizers like Optuna (Akiba et al., 2019) and AdamW (Loshchilov & Hutter, 2019), but does not provide version numbers for core software dependencies such as programming languages or deep learning frameworks (e.g., Python, PyTorch).
Experiment Setup	Yes	For all experiments except for the ablation study, we set the context size to 1000 and fixed the number of bootstrap sampling iterations to 16. For the feature transformation encoder, the main structure is a two-layer MLP, with both the hidden and output dimensions set to 100, using the ReLU activation function. During fine-tuning, we use the pre-trained model checkpoint1, and fine-tuning is performed using the AdamW (Loshchilov & Hutter, 2019) optimizer with a learning rate of 0.003, weight decay of 1e-5, and a batch size of 1024.