reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mixture of In-Context Prompters for Tabular PFNs

Authors: Derek Xu, Olcay Cirit, Reza Asadi, Yizhou Sun, Wei Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results shows MIXTUREPFN outperforms 19 baselines both in mean rank and as the Condorcet winner across 36 diverse tabular datasets. We evaluate MIXTUREPFN on the recently proposed TABZILLA benchmark (Mc Elfresh et al., 2023). TABZILLA is the largest tabular benchmark, with 36 hardest datasets out of 176 tabular classification datasets and 19 baseline algorithms, covering both deep learning and GBDTS.
Researcher Affiliation	Collaboration	Derek Xu University of California, Los Angeles EMAIL Olcay Cirit Uber Technologies Inc Reza Asadi Uber Technologies Inc Yizhou Sun University of California, Los Angeles Wei Wang University of California, Los Angeles
Pseudocode	No	The paper does not contain any explicit pseudocode or algorithm blocks. Figure 1 and 6 are illustrations of the model architecture and differences, not pseudocode.
Open Source Code	Yes	MIXTUREPFN both achieves the highest mean rank with statistical significance and is the Condorcet winner across 36 diverse tabular datasets against 19 strong deep learning and tree-based baselines. We will release our code on Github.
Open Datasets	Yes	We evaluate MIXTUREPFN on the recently proposed TABZILLA benchmark (Mc Elfresh et al., 2023). TABZILLA is the largest tabular benchmark, with 36 hardest datasets out of 176 tabular classification datasets and 19 baseline algorithms, covering both deep learning and GBDTS. All results were collected over 10-folds following TABZILLA (Mc Elfresh et al., 2023) and Open ML.
Dataset Splits	Yes	Our dataset is split into train/dev/test sets. During hyperparameter tuning, decoder tokens are taken from the dev set instead. We randomly split the bootstrapped dataset Dbootstrap p(D\|Dtrain) into train/test splits to obtain the labelled prompt , (Xsubtest, Ysubtest, Dsubtrain). In this case, we sample from p(D\|Dtrain) by randomly sampling 90% of training samples without replacement to obtain Dsubtrain and treating the remaining 10% of sample as Xsubtest, Ysubtest. All results were collected over 10-folds following TABZILLA (Mc Elfresh et al., 2023) and Open ML. We tune the hyperparameters by splitting the train set of each fold into training and validation following TABZILLA (Mc Elfresh et al., 2023).
Hardware Specification	Yes	All experiments were conducted on an Nvidia V100 GPU and an AMD EPYC 7402 CPU.
Software Dependencies	No	The paper mentions using the FAISS library for the router and improving TABPFN with Flash Attention and torch compile, but it does not specify any version numbers for these software components or other key libraries like Python or PyTorch.
Experiment Setup	Yes	As TABPFN transformers can handle up to 3,000 training samples, we set B = 3, 000. We empirically found the minimum number of iterations and batch-size required for loss convergence on the artificial-characters dataset to be 128 iterations and Nbatch = 64, which we set for all other datasets. During inference, we use a larger batch size, Nbatch = 1024, as gradients no longer need to be stored. We finetuned the model using the Adam optimizer with a learning rate of 0.001. As TABPFN transformers can handle up to 100 features, for datasets with over 100 features and TABPFN-based models, we use Maximum Relevance and Minimum Redundancy (m RMR) feature selection (Ding & Peng, 2005) to reduce the number of features to 100. We follow the TABZILLA benchmark, setting Nensemble = 16, which shuffles features Nensemble/2 times for both the original and applies power-law scaled features. Due to the large variability in datasets in the TABZILLA benchmark, we try 4 hyperparameter settings: (1) γ = 5.0, (2) γ = 1.0, (3) γ = 1.0 but MRMR with 50 features instead of 100 features for feature count scalability, and (4) γ = 1.0 but with Catboost instead of Ordinal encoding for categorical feature scalability (Hollmann et al., 2022). Hyperparameters are chosen by picking the setting which maximizes performance on the validation set.