TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
Authors: Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across 200 classification datasets from the TALENT benchmark, Tab ICL is on par with Tab PFNv2 while being systematically faster (up to 10 times), and significantly outperforms all other approaches. On 53 datasets with over 10K samples, Tab ICL surpasses both Tab PFNv2 and Cat Boost, demonstrating the potential of ICL for large data. We evaluate Tab ICL on the TALENT benchmark (Ye et al., 2025), comprising 200 classification datasets across various domains and sizes (up to 150K samples). |
| Researcher Affiliation | Academia | 1SODA team, INRIA Saclay, France 2Sierra team, INRIA Paris, France 3Ecole Normale Supérieure, PSL Research University, Paris, France. Correspondence to: Jingang Qu <EMAIL>. |
| Pseudocode | No | The paper describes methods through textual descriptions, mathematical equations, and architectural diagrams (e.g., Figure 1, Figure 2) rather than explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Pretraining code, inference code, and pre-trained models are available at https://github.com/soda-inria/tabicl. |
| Open Datasets | Yes | We evaluate Tab ICL on the TALENT benchmark (Ye et al., 2025), comprising 200 classification datasets across various domains and sizes (up to 150K samples). |
| Dataset Splits | Yes | Datasets are split into 64% training, 16% validation, and 20% test data. |
| Hardware Specification | Yes | The pretraining took 20 days on three A100 GPUs with 40GB memory using PyTorch (16, 3, and 1 days for stage 1, 2, and 3, respectively). |
| Software Dependencies | No | The paper mentions "PyTorch" as the framework used for pretraining, but does not specify a version number. Other software like XGBoost is mentioned for synthetic data generation but not as a core dependency for the described methodology with a specific version. |
| Experiment Setup | Yes | We employed a three-stage procedure: 1. NB = 4 with a fixed size of 1,024 for 160K steps; 2. NB = 1 with the size randomly drawn from a loguniform distribution between 1K and 40K over 2K steps... 3. NB = 1 with the size uniformly sampled between 40K and 60K for 50 steps... We use Adam (Kingma & Ba, 2014) and clip the gradient norm to 1. The learning rate schedules for pretraining are shown in Figure E.1, including: Cosine decay with restarts for stage 1, Polynomial decay for stage 2 and the learning rate is given by (lrinit lrend) (1 step/T)2 + lrend, where lrinit = 2e-5, lrend = 5e-6... Flash Attention and automatic mixed precision are applied globally. |