reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Table Foundation Models: on knowledge pre-training for tabular learning

Authors: Myung Jun Kim, Félix Lefebvre, Gaëtan Brison, Alexandre Perez-Lebel, Gaël Varoquaux

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Section 4, titled "Empirical study: TARTE improves prediction and speed, and is reusable", details the experimental setup, datasets, methods, and presents results using learning curves, critical difference diagrams, and Pareto diagrams. This includes evaluation on 40 regression and 11 classification datasets, comparison against baselines, and analysis of prediction scores and runtime costs.
Researcher Affiliation	Collaboration	The authors are affiliated with "Inria Saclay" (academic research institute), "Institut Polytechnique de Paris" and "New York University" (academic institutions), and also "Fundamental Technologies" and "Probabl.ai" (private companies). This mix indicates a collaboration between academia and industry.
Pseudocode	No	The paper describes the architecture and pre-training process in text and with diagrams (Figure 1 and Figure 2), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The implementation of TARTE is available at https://github.com/soda-inria/tarte-ai.
Open Datasets	Yes	We use the benchmark from Kim et al. (2024): 40 regression and 11 classification datasets3. These datasets are available at https://huggingface.co/datasets/inria-soda/carte-benchmark. Additionally, TARTE expands the coverage of pre-train data by combining two large knowledge bases, YAGO4.5 (Suchanek et al., 2024) and Wikidata (Vrandečić & Krötzsch, 2014).
Dataset Splits	Yes	For evaluation on single tables, the train-size for each tables varied from 32, 64, 128, 256, 512, 1 024, and 10 000; remaining data was set as the test set. Overall, the results were recorded on 10 different train/test splits for each dataset. For fine-tuned models, we train multiple models on different train-validation splits used for early-stopping.
Hardware Specification	Yes	For the pre-training of TARTE, a single NVIDIA A40 (48GB) gpu was used. The downstream experiments was run on 32 cores of CPU except for Tab PFNv2 variants for all n, and TARTE FT and CARTE FT at n = 10 000, in which we used gpus. The hardware was chosen based on availability. GPUs: NVIDIA V100 (32GB VRAM), A40 (40GB / 48GB VRAM) CPUs: AMD EPYC 7742 64-Core Processor, AMD EPYC 7702 64-Core Processor (512GB RAM), Intel(R) Xeon(R) CPU E5-2660 v2, Intel(R) Xeon(R) Gold 6226R CPU (256GB RAM)
Software Dependencies	No	The paper mentions several software components like Fast Text, skrub, scikit-learn, XGBoost, and Cat Boost. However, it does not provide specific version numbers for any of these software dependencies, which is required for reproducibility.
Experiment Setup	Yes	The transformer architecture of TARTE is specified as follows: We set three selfattention layers with 24 multi-head attentions, 768 hidden dimension, and 2 048 feed-forward dimension per layer. For the projection layers for contrastive learning, we set two linear layers with hidden and output dimensions of 2 048 and 768, respectively. The resulting model contains over 25 million trainable parameters. The batch size is set as 512... The total number of steps for training is 200 000, with the Adam W optimizer and the cosine scheduler. The learning rates were set as lrmin = 10 8, lrmax = 10 6 of warm-up over the first 2 000 steps, followed by a linear decay in learning rate schedule. The probability for all dropout layers was set as 0.1. Table 3 shows the hyperparameter spaces for each method.