Geodesic Flow Kernels for Semi-Supervised Learning on Mixed-Variable Tabular Dataset

Authors: Yoontae Hwang, Yongjae Lee

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To rigorously evaluate GFTab, we curate a comprehensive set of 21 tabular datasets spanning various domains, sizes, and variable compositions. Our experimental results show that GFTab outperforms existing ML/DL models across many of these datasets, particularly in settings with limited labeled data.
Researcher Affiliation Academia 1University of Oxford 2Ulsan National Institute of Science and Technology (UNIST) EMAIL, EMAIL
Pseudocode No The paper describes the methodology using natural language and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Yoontae6719/Geodesic-Flow Kernels-for-Semi-Supervised-Learning-on-Mixed Variable-Tabular-Dataset
Open Datasets Yes We selected 21 datasets after carefully reviewing more than 4,000 datasets including Open ML (3,953 datasets), AMLB (Gijsbers et al. 2022) (71 datasets), and (Grinsztajn, Oyallon, and Varoquaux 2022)(22 datasets).
Dataset Splits No The paper mentions evaluating GFTab under conditions of "20% labeled training data" and "10% labeled setting," but it does not specify the overall training, validation, and test splits for the entire datasets.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments. It mentions referring to Appendix B for other settings, but Appendix B is not included in the provided text.
Software Dependencies No The paper mentions using XGBoost (Chen and Guestrin 2016), Cat Boost (Prokhorenkova et al. 2018), and Optuna (Akiba et al. 2019) but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes The GFTab is based on semi-supervised learning so that it is trained to minimize LGFTab = Lsim+βLce. we compared the performance of the model across various ranges of β. As a result, β = 1.0 yielded the best balance.