reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Scalable MIP-based Method for Learning Optimal Multivariate Decision Trees

Authors: Haoran Zhu, Pavankumar Murali, Dzung Phan, Lam Nguyen, Jayant Kalagnanam

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using 36 data-sets from the University of California Irvine Machine Learning Repository, we demonstrate that our formulation outperforms its counterparts in the literature by an average of about 10% in terms of mean out-of-sample testing accuracy across the data-sets.
Researcher Affiliation	Industry	Haoran Zhu, Pavankumar Murali, Dzung T. Phan, Lam M. Nguyen, Jayant R. Kalagnanam IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA EMAIL, EMAIL, EMAIL, Lam EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 LP-based data-selection in each cluster N
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code or a link to a code repository for the methodology described.
Open Datasets	Yes	For benchmarking, we use data-sets from the UCI Machine Learning Repository [9].
Dataset Splits	Yes	We used the same training-testing split as in [4], which is 75% of the entire data-set as training set, and the rest 25% as testing set, with 5-fold cross-validation.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., CPU/GPU models, memory, or cloud instance types).
Software Dependencies	Yes	We implemented all these MIP approaches in Python 3.6 and solved them using CPLEX 12.9.0 [8]. We invoked the Decision Tree Classiﬁer implementation from scikit-learn to train a decision tree using CART, using default parameter settings.
Experiment Setup	Yes	The time limit was set to be 15 or 30 minutes for medium-sized data-sets, and for larger data-sets we increased it up to 4 hours to ensure that the solver could make sufﬁcient progress. For all methods, the maximum tree depth was set to be the same as our SVM1-ODT. For solving S1O and OCT-H, we use the decision tree trained using CART as a warm start solution for CPLEX, as in [4, 24]. In Algorithm 1, T( λ) is the transformation we used in the proof of Theorem 4 to construct a feasible solution for (6), and β1, β2 are some pre-given threshold parameters.