Confidence Decision Trees via Online and Active Learning for Streaming Data

Authors: Rocco De Rosa, Nicolò Cesa-Bianchi

JAIR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on real and synthetic data in a streaming setting show that our trees are indeed more accurate than trees with the same number of leaves generated by state-of-the-art techniques.
Researcher Affiliation Academia Rocco De Rosa EMAIL Dipartimento di Informatica Universit a degli Studi di Milano 20135 Milano, Italy Nicol o Cesa-Bianchi EMAIL Dipartimento di Informatica Universit a degli Studi di Milano 20135 Milano, Italy
Pseudocode Yes Algorithm 1 C-Tree Input: Threshold τ > 0... Algorithm 2 Query Strategy... Algorithm 3 Online Stream Validation Protocol... Algorithm 4 Random Strategy... Algorithm 5 Variable Uncertainty Strategy... Algorithm 6 Split Strategy... Algorithm 7 Rand CBT Input: tree T, total number of leaves num-leaves, number of attributes d, leaf class conditional probability q Output: complete binary tree T
Open Source Code No The paper mentions using and modifying the H-Tree algorithm implemented in MOA, but does not provide a specific link or statement for their own modified code being open-source.
Open Datasets Yes A9A, COD-RNA and COVERTYPE are from the LIBSVM binary classification repository8. AIRLINES and ELECTRICITY are from the MOA collection9.
Dataset Splits No The paper describes experiments in a streaming setting using "Interleaved Test-Then-Train validation in MOA" and creating "ten different streams from each dataset ... by taking a random permutation of the examples in it." This implies a continuous, sequential processing of data rather than fixed train/test/validation splits.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments.
Software Dependencies No The paper mentions that the Hoeffding Tree (H-Tree) algorithm is implemented in MOA and that they modified the H-Tree code in MOA, but it does not specify a version number for MOA or any other software dependencies.
Experiment Setup Yes The grace period parameter7 was set to 100. In contrast to the typical experimental settings in the literature, we did not consider the tie-break parameter because in the experiments we observed that it caused the majority of the splits. Based on Theorem 4 and Remark 1 (which leads to the choice δt = 1t ), we used the following version of our confidence bounds εKM and εGini (the bound for εent contains an extra ln m factor), eεKM = eεGini = c 1 m ln m2h2td (7) where the parameter c is used to control the number of splits. ...The parameters δ in H-Tree and Corr H-Tree, and c in C-Tree ... were individually tuned on each dataset using a grid of 200 values, hence plots show the online performance of each algorithm when it is close to be optimally tuned. The ranges for the parameters are c (0, 2) and δ (0, 1). The optimal values of c and δ generated in the tuning phase are, respectively, typically around 5 10−3 and 10−2. ... As explained by Zliobaite et al. (2014), the parameter s can be safely set to a default value 0.01. We performed all the experiments with this setting. ... In the experiments we set the parameter ν = 0.2.