reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

Authors: Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across 200 classification datasets from the TALENT benchmark, Tab ICL is on par with Tab PFNv2 while being systematically faster (up to 10 times), and significantly outperforms all other approaches. On 53 datasets with over 10K samples, Tab ICL surpasses both Tab PFNv2 and Cat Boost, demonstrating the potential of ICL for large data. We evaluate Tab ICL on the TALENT benchmark (Ye et al., 2025), comprising 200 classification datasets across various domains and sizes (up to 150K samples).
Researcher Affiliation	Academia	1SODA team, INRIA Saclay, France 2Sierra team, INRIA Paris, France 3Ecole Normale Supérieure, PSL Research University, Paris, France. Correspondence to: Jingang Qu <EMAIL>.
Pseudocode	No	The paper describes methods through textual descriptions, mathematical equations, and architectural diagrams (e.g., Figure 1, Figure 2) rather than explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Pretraining code, inference code, and pre-trained models are available at https://github.com/soda-inria/tabicl.
Open Datasets	Yes	We evaluate Tab ICL on the TALENT benchmark (Ye et al., 2025), comprising 200 classification datasets across various domains and sizes (up to 150K samples).
Dataset Splits	Yes	Datasets are split into 64% training, 16% validation, and 20% test data.
Hardware Specification	Yes	The pretraining took 20 days on three A100 GPUs with 40GB memory using PyTorch (16, 3, and 1 days for stage 1, 2, and 3, respectively).
Software Dependencies	No	The paper mentions "PyTorch" as the framework used for pretraining, but does not specify a version number. Other software like XGBoost is mentioned for synthetic data generation but not as a core dependency for the described methodology with a specific version.
Experiment Setup	Yes	We employed a three-stage procedure: 1. NB = 4 with a fixed size of 1,024 for 160K steps; 2. NB = 1 with the size randomly drawn from a loguniform distribution between 1K and 40K over 2K steps... 3. NB = 1 with the size uniformly sampled between 40K and 60K for 50 steps... We use Adam (Kingma & Ba, 2014) and clip the gradient norm to 1. The learning rate schedules for pretraining are shown in Figure E.1, including: Cosine decay with restarts for stage 1, Polynomial decay for stage 2 and the learning rate is given by (lrinit lrend) (1 step/T)2 + lrend, where lrinit = 2e-5, lrend = 5e-6... Flash Attention and automatic mixed precision are applied globally.