Classifier Cascades and Trees for Minimizing Feature Evaluation Cost

Authors: Zhixiang (Eddie) Xu, Matt J. Kusner, Kilian Q. Weinberger, Minmin Chen, Olivier Chapelle

JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate CSTC on a synthetic cost-sensitive learning task and compare it with competing algorithms on two large-scale, real world benchmark problems. Additionally, we discuss the differences between our models for several learning settings. We provide further insight by analyzing the features extracted on a these data sets and looking at how CSTC tree partitions the input space. We judge the effect of the cost-sensitive regularization by looking at how removing terms and varying parameters affects CSTC on real world data sets. We also present detailed results of CSTC on a cost-sensitive version of the MNIST data set, demonstrating that it extracts intelligent per-instance features.
Researcher Affiliation Collaboration Zhixiang (Eddie) Xu EMAIL Matt J. Kusner EMAIL Kilian Q. Weinberger EMAIL Department of Computer Science Washington University 1 Brookings Drive St. Louis, MO 63130, USA Minmin Chen EMAIL Olivier Chapelle EMAIL Criteo 411 High Street Palo Alto, CA 94301, USA
Pseudocode Yes Algorithm 1 CSTC global optimization
Open Source Code No The paper does not contain an explicit statement about the release of source code, nor does it provide a link to a code repository or supplementary materials containing code.
Open Datasets Yes To evaluate the performance of CSTC on real-world tasks, we test it on the Yahoo! Learning to Rank Challenge (LTR) data set. The set contains 19,944 queries and 473,134 documents. Each query-document pair xi consists of 519 features. We created a cost-sensitive binary MNIST data set by first extracting all images of digits 3 and 8. We resized them to four different resolutions: 4 x 4, 8 x 8, 16 x 16, and 28 x 28 (the original size), and concatenated all features, resulting in d = 1120 features.
Dataset Splits Yes We compute the validation error of the initialized CSTC tree at each node. Starting with the leaf nodes, we then prune away nodes that, upon removal, do not decrease the validation performance (in the case of ranking data, we even can use validation NDCG (Järvelin and Kekäläinen, 2002) as our pruning criterion). For CSTC we evaluate eight settings, λ = {1/2, 1, 2, 3, 4, 5, 6}. In the case of stage-wise regression, which is not cost-sensitive, the curve is simply a function of boosting iterations. We include CSTC with and without fine-tuning. To select hyperparameters C (SVM cost) and γ (kernel width) we used 100 rounds of Bayesian optimization on a validation set (we found C = 753.1768, γ = 0.0198). An RBF-SVM trained with these hyperparameters achieves a test error of 0.005 for cost c = 1120.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments.
Software Dependencies No The paper mentions techniques like "gradient boosted regression trees" and "support vector machine (SVM) with a radial basis function (RBF) kernel" but does not specify the software libraries or their version numbers used for implementation.
Experiment Setup Yes To introduce non-linearity, we transform the input features into a non-linear feature space x -> φ(x) with the boosting trick (see Section 5) with T = 3000 iterations of gradient boosting and CART trees of maximum depth 4. Unless otherwise stated, we determine the CSTC depth by validation performance (with a maximum depth of 10). For CSTC we evaluate eight settings, λ = {1/2, 1, 2, 3, 4, 5, 6}. We set the maximum number of Cronus nodes to 10, and set all other parameters (e.g., keep ratio, discount, early-stopping) based on a validation set. To select hyperparameters C (SVM cost) and γ (kernel width) we used 100 rounds of Bayesian optimization on a validation set (we found C = 753.1768, γ = 0.0198).