reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Decision Tree Induction Through LLMs via Semantically-Aware Evolution

Authors: Tennison Liu, Nicolas Huynh, Mihaela van der Schaar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate across various benchmarks that LLEGO evolves superior-performing trees compared to existing tree induction methods, and exhibits significantly more efficient search performance compared to conventional GP approaches. Empirically, on a wide range of classification and regression tabular benchmarks, we demonstrate that LLEGO significantly improves search efficiency and consistently evolves trees with superior generalization performance.
Researcher Affiliation	Academia	Tennison Liu , Nicolas Huynh & Mihaela van der Schaar DAMTP, University of Cambridge Cambridge, UK EMAIL
Pseudocode	No	The paper includes a 'LLEGO Overview' diagram (Figure 1) which illustrates the algorithm's flow, but it is a visual representation of steps rather than a formal pseudocode block with structured variables, loops, and conditional statements. The 'END-TO-END ALGORITHM' in Section 3.4 describes the process in prose.
Open Source Code	Yes	We provide the code to reproduce our results at https://github. com/nicolashuynh/LLEGO, and https://github.com/tennisonliu/LLEGO.1 Also available at the wider lab repository https://github.com/vanderschaarlab/LLEGO.
Open Datasets	Yes	We empirically evaluate LLEGO s ability to find performant decision trees for 12 open-source tabular datasets from Open ML curated benchmarks (Vanschoren et al., 2014), including 7 classification and 5 regression datasets. These datasets were selected based on the number of features, samples and the presence of semantically meaningful feature names and descriptions. We provide further details on this selection of datasets and preprocessing in Appendix C.1.
Dataset Splits	Yes	We preprocess the dataset using a train-validation-test split ratio of [0.2, 0.4, 0.4]. The low training split is used to accentuate the difference in performance as given sufficient training data, all methods perform comparably. For each run, we only vary the seed used for data splitting, such that for seed 0, we use train_test_split(seed=0).
Hardware Specification	Yes	We run all experiments on an AMD EPYC 7V13 64-Core Processor.
Software Dependencies	Yes	For our experiments, we use gpt-35-turbo, version 0301 with default hyperparameters temperature = 0.7 and top_p = 0.95.
Experiment Setup	Yes	For our instantiation of LLEGO in Section 5, we use N = 25 and G = 25. We seed the algorithm with a population of trees generated by CART, where each tree is fitted on 25% of the Dtrain. We use the same population to initialize GATree. In each iteration, we generate 25 crossover offspring and 25 mutation offspring... We use the elitism selection to preserve the top 25 trees... To compute the desired fitness, we use α = 0.1... We use τ = 10 for diversity guidance. For each genetic operation, we use λ = 4 parent trees. For our experiments, we use gpt-35-turbo, version 0301 with default hyperparameters temperature = 0.7 and top_p = 0.95. Hyperparameter tuning. We use Optuna (Akiba et al., 2019) and the default Tree-Parzen Estimator for hyperparameter tuning (HPT) (Watanabe, 2023). For all baselines, we permit wall-clock time to a maximum of 10 minutes. This allows 50 iterations of HPT for CART and C4.5, and 10 iterations for the computationally more intensive DL8.5, GOSDT, and GATree. In each iteration of HPT, we evaluate the objective on the validation set, selecting the best configuration to evaluate on the test set. Table 5: Hyperparameter search ranges.