Learning Accurate Decision Trees with Bandit Feedback via Quantized Gradient Descent

Authors: Ajaykrishna Karthikeyan, Naman Jain, Nagarajan Natarajan, Prateek Jain

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using extensive validation on standard benchmarks, we demonstrate that our method provides best of both worlds, i.e., it is competitive to, and in some cases more accurate than methods designed specifically for the supervised settings; and in bandit settings, where most existing tree learning techniques are not applicable, our models are still accurate and significantly outperform the applicable SOTA methods. We provide algorithms for standard supervised learning (DGT) and bandit (DGT-Bandit) settings, and extensively evaluate them on multiple classification and regression benchmarks, in supervised as well as in bandit settings. Section 5 is dedicated to "Experiments" and contains numerous performance tables and figures comparing methods on various datasets.
Researcher Affiliation Collaboration Ajaykrishna Karthikeyan EMAIL Department of Computer Science University of Illinois, Urbana-Champaign, Naman Jain EMAIL Department of Electrical Engineering and Computer Science University of California, Berkeley, Nagarajan Natarajan EMAIL Microsoft Research, India, Prateek Jain EMAIL Google Research, India
Pseudocode Yes Algorithm 1 DGT-Bandit: Learning decision trees in the bandit setting, Algorithm 2 Back Prop, Algorithm 3 DGT: Learning decision trees in the supervised learning (batch) setting
Open Source Code Yes Source code is available at https://github.com/microsoft/dgt. (Footnote 2)
Open Datasets Yes Datasets. We use all regression datasets from three recent works: 3 large tabular datasets from Popov et al. (2020), a chemical dataset from Lee and Jaakkola (2020) and 4 standard (UCI) datasets from Zharmagambetov and Carreira-Perpinan (2020). For classification, we use 8 multi-class datasets from Carreira-Perpinán and Tavallali (2018) and 2 large multi-class datasets from Féraud et al. (2016) on which we report accuracy and 2 binary chemical datasets from Lee and Jaakkola (2020) where we follow their scheme and report AUC. Sizes, splits and other details for all the datasets are given in Appendix B. Table 8: Classification datasets description and Table 9: Regression datasets description which list sources and links for various datasets.
Dataset Splits Yes Whenever possible we use the train-test split provided with the dataset. In cases where a separate validation split is not provided, we split training dataset in 0.8 : 0.2 proportion to create a validation. When no splits are provided, we create five random splits of the dataset and report the mean test scores across those splits, following Zharmagambetov and Carreira-Perpinan (2020). (Appendix B) and Table 8: Classification datasets description and Table 9: Regression datasets description which list "splits (tr:val:test)" for each dataset, e.g., "0.64 : 0.16 : 0.2" and "0.5 : 0.1 : 0.4".
Hardware Specification No In all the experiments, for a fixed tree height and a dataset, our DGT implementation takes 20 minutes for training (and 1 hour including hyper-parameter search over four GPUs.). The paper mentions using "four GPUs" but does not specify the exact GPU models, processor types, or any other specific hardware details.
Software Dependencies Yes We implemented our algorithms in Py Torch v1.7 with CUDA v112. (Section 5) and mentions using "scikit-learn", "liblinear solver", and "Vowpal Wabbit (VW)" for other implementations (Appendix C.3.1 and C.3.2).
Experiment Setup Yes For all tree methods we vary h {2, 4, 6, 8, 10}, (1) for LCN: optimizer, learning-rate, dropout, and hidden layers of gϕ; (2) for TAO: λ1 regularization; (3) for DGT: momentum, regularization λ1, λ2 and overparameterization L, (4) for TEL: learning rate and regularization λ2. (Section 5) and further detailed in Appendix C.3.1 "Offline/batch setting" which states: "We train using RMSProp optimizer with learning-rate 1e-2 and cosine scheduler with 3 restarts; gradient clipping 1e-2; momentum of optimizer {0, 0.3} for regression and {0, 0.3, 0.8} for classification; regularization λ1, λ2 {5e-5, 1e-5, 5e-6, 1e-6, 0}; and overparameterization L {1, 3} with hidden space dimensions given in Table 12. We use a mini-batch size of 128 and train our model for 30-400 epochs based on the size of dataset. Table 11 summarizes the choices for all the hyperparameters."