reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLM-Guided Self-Supervised Tabular Learning With Task-Specific Pre-text Tasks

Authors: Sungwon Han, Seungeon Lee, Meeyoung Cha, Sercan O Arik, Jinsung Yoon

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	TST-LLM consistently outperforms contemporary baselines with win ratios of 95% and 81%, when applied to 22 benchmark tabular datasets, including binary and multi-class classiﬁcation, and regression tasks. 4 Experiment We evaluate TST-LLM across multiple tabular datasets with various downstream tasks. Through our experiments, we discuss which components of the model contributed to performance enhancements and how our model operates.
Researcher Affiliation	Collaboration	Sungwon Han EMAIL Korea Advanced Institute of Science and Technology (KAIST) Seungeon Lee EMAIL Korea Advanced Institute of Science and Technology (KAIST) Meeyoung Cha EMAIL Max Planck Institute for Security and Privacy Sercan Ö. Arik EMAIL Google Cloud AI Jinsung Yoon EMAIL Google Cloud AI
Pseudocode	Yes	Algorithm 1: Algorithm for TST-LLM. Input : Original dataset D, Large language model backbone LLM, Encoder f, Original input feature set Y, Number of features to select M, Entropy threshold tent, Meta-information Etask, Ename, and Edesc. Output : Trained encoder f. ... Algorithm 2: Algorithm for feature selection with minimum redundancy.
Open Source Code	Yes	Our code is available on Github1. 1https://github.com/Sungwon-Han/TST-LLM
Open Datasets	Yes	Adult (Asuncion & Newman, 2007), Balance-scale (Siegler, 1994), Bank (Moro et al., 2014), Blood (Yeh et al., 2009), Car (Kadra et al., 2021), Communities (Redmond, 2009), Credit-g (Kadra et al., 2021), Diabetes (Smith et al., 1988), Eucalyptus (Bulloch et al., 1991), Forest-ﬁres (Cortez & Morais, 2008), Heart (fedesoriano, 2021), Junglechess (van Rijn & Vis, 2014), Myocardial (Golovenkin et al., 2020), Tic-tac-toe (Aha, 1991), Vehicle (Mowforth & Shepherd), Bike (Fanaee-T, 2013), Crab (Sidhu, 2021), Housing (Pace & Barry, 1997), Insurance (Datta, 2020), Wine (Cortez & Reis, 2009), Sequence-type, and Solution-mix. Descriptive statistics and task descriptions for each dataset are available in the Appendix A.1 and B.
Dataset Splits	No	The paper lists multiple benchmark datasets, but does not explicitly state the train/test/validation split ratios or methodology used for these datasets in the main text or appendices. It mentions "Experiments were run with 3 diﬀerent random seeds, and the average values were reported" but this does not specify data splits.
Hardware Specification	Yes	The comparison was conducted on the Adult dataset using a single A100 GPU.
Software Dependencies	No	The paper mentions using GPT-3.5 as the LLM backbone and Adam optimizer, but does not specify version numbers for any software libraries, frameworks, or programming languages used for implementation.
Experiment Setup	Yes	During LLM generation, the temperature was set to 0.5 and the top-p value was set to the API s default of 1. The discovery process generated ﬁve features per trial, with the number of trials set at 40. [...] The number of selected features M was set to 20. [...] Training utilized the Adam optimizer with a learning rate of 1e-4, a batch size of 128, and 1000 training iterations.