Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

TabFSBench: Tabular Benchmark for Feature Shifts in Open Environments

Authors: Zi-Jian Cheng, Ziyi Jia, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper conducts the first comprehensive study on feature shifts in tabular data and introduces the first tabular feature-shift benchmark (Tab FSBench). Tab FSBench evaluates impacts of four distinct feature-shift scenarios on four tabular model categories across various datasets and assesses the performance of large language models (LLMs) and tabular LLMs in the tabular benchmark for the first time. Our study demonstrates three main observations
Researcher Affiliation Academia 1School of Intelligence Science and Technology, Nanjing University, China 2National Key Laboratory for Novel Software Technology, Nanjing University, China 3School of Artificial Intelligence, Nanjing University, China. Correspondence to: Yu-Feng Li <EMAIL>, Lan-Zhe Guo <EMAIL>.
Pseudocode No The paper includes a 'LLM Prompt for the experiments' in Figure 7, which shows a structured input example for an LLM. However, it is not presented as an algorithm block or pseudocode for a specific method or procedure. There are no explicitly labeled 'Algorithm' or 'Pseudocode' sections.
Open Source Code Yes The benchmark code for this paper is available at https://github.com/LAMDASZ-ML/Tab FSBench.
Open Datasets Yes To effectively reproduce feature-shift scenarios, we select open-source and reliable datasets from Open ML and Kaggle s extensive dataset library, including three curated tasks of binary classification, multi-class classification, and regression, covering various domains such as finance, healthcare and geology. The primary attributes of the datasets used in Tab FSBench are presented in Table 1. Detailed information on the datasets can be found in Appendix D.
Dataset Splits No The paper states: 'We begin by partitioning the dataset into a train&validation set and a set of test sets. Appendix D shows the segmentation details of each dataset, including Pearson correlation heat maps of datasets.' However, Appendix D provides information about the datasets and their Pearson correlation analysis, but it does not specify explicit percentages or sample counts for the train/validation/test splits, nor does it refer to predefined standard splits with sufficient detail to reproduce the exact partitioning.
Hardware Specification Yes The deep learning models, LLMs, and Tabular LLMs were trained on an NVIDIA A800 GPU. Gradient-boosted tree models, where applicable, were trained on a CPU rather than a GPU, using an AMD Ryzen 5 7500F 6-Core Processor.
Software Dependencies No The paper mentions using the Optuna framework for hyperparameter optimization and lists various models (e.g., Light GBM, XGBoost, Cat Boost, Llama3-8B) and tabular deep-learning models. However, it does not specify version numbers for any of these software components, which is required for a reproducible description of ancillary software.
Experiment Setup Yes Hyperparameter Optimization. We use hyperparameter optimization to help models achieve optimal performance in different datasets. In Appendix E, we provide full hyperparameter grids for each model.