Near-Optimal Decision Trees in a SPLIT Second
Authors: Varun Babbar, Hayden Mctavish, Cynthia Rudin, Margo Seltzer
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show dramatic improvements in runtime compared to state of the art algorithms, with negligible loss in accuracy or sparsity. For all experiments below we set the depth budget of our algorithms to 5. The lookahead depth for Algorithm 2 is set to 2 since, from Corollary 6.2, this produces the lowest runtime for the chosen depth budget. We defer more details of our experimental setup and datasets to Appendix A.7.4. Appendix A.3 has additional evaluations of our methods. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Duke University, Durham, USA 2Department of Computer Science, University of British Columbia, Vancouver, Canada. Correspondence to: Varun <EMAIL>, Hayden <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 get bounds(D, dl, d, d , N) lb, ub... Algorithm 2 SPLIT(ℓ, D, λ, dl, d, p)... Algorithm 3 Lickety SPLIT(ℓ, D, λ, d)... Algorithm 4 Greedy(D, d, λ) (tgreedy, lb)... Algorithm 5 RESPLIT(ℓ, D, λ, dl, d)... |
| Open Source Code | Yes | Code for our algorithms and experiments can be found at https://github.com/VarunBabbar/SPLIT-ICML. |
| Open Datasets | Yes | The Home Equity Line of Credit (HELOC) (FICO, 2018) dataset used for the Explainable ML Challenge... The Bike dataset (Fanaee-T & Gama, 2013)... FICO. Home equity line of credit (heloc) dataset. https://community.fico.com/s/ explainable-machine-learning-challenge, 2018. FICO Explainable Machine Learning Challenge. Fanaee-T, H. and Gama, J. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, pp. 1–15, 2013. |
| Dataset Splits | Yes | All datasets were evaluated on three random 80-20 train-test splits of the data, with the average and standard error reported. All metrics are averaged over 3 test-train splits. |
| Hardware Specification | Yes | All experiments were performed on an institutional computing cluster. This was a single Intel(R) Xeon(R) Gold 6226 machine with a 2.70GHz CPU. It has 300GB RAM and 24 cores. |
| Software Dependencies | No | The paper mentions using the standard 'scikit-learn Decision Tree Classifier class' and building upon the 'GOSDT' and 'Tree FARMS' codebase. However, specific version numbers for these or any other key software dependencies (like Python, PyTorch, etc.) are not provided in the text. |
| Experiment Setup | Yes | For all experiments below we set the depth budget of our algorithms to 5. The lookahead depth for Algorithm 2 is set to 2... We vary the sparsity penalty, λ, which is a common input to all algorithms in this figure, and compute the regularized test objective from Equation 1 for each value of λ. We set a timeout limit of 1200 seconds for GOSDT, after which it gives the best solution found so far. |