reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Random Rotation Ensembles

Authors: Rico Blaser, Piotr Fryzlewicz

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we introduce a method that is simple to implement yet general and eﬀective in improving ensemble diversity with only modest impact on the accuracy of the individual base learners. By randomly rotating the feature space prior to inducing the base learners, we achieve favorable aggregate predictions on standard data sets compared to state of the art ensemble methods, most notably for tree-based ensembles, which are particularly sensitive to rotation. Keywords: feature rotation, ensemble diversity, smooth decision boundary
Researcher Affiliation	Academia	Rico Blaser EMAIL Piotr Fryzlewicz EMAIL Department of Statistics London School of Economics Houghton Street London, WC2A 2AE, UK
Pseudocode	Yes	The necessary modiﬁcations are illustrated in pseudo code in Listing 1 below. All methods tested use classiﬁcation or regression trees that divide the predictor space into disjoint regions Gj, where 1 j J, with J denoting the total number of terminal nodes of the tree. Extending the notation in Hastie et al. (2009), we represent a tree as
Open Source Code	Yes	For this reason, we provide random rotation code in C/C++ and R in Appendix A, which can be used as a basis for enhancing existing software packages.
Open Datasets	Yes	For our comparative study of random rotation, we selected UCI data sets (Bache and Lichman, 2013) that are commonly used in the machine learning literature in order to make the results easier to interpret and compare. Table 5 in Appendix C summarizes the data sets, including relevant dimensional information.
Dataset Splits	Yes	For each experiment we performed a random 70-30 split of the data; 70% training data and the remaining 30% served as testing data. The split was performed uniformly at random but enforcing the constraint that at least one observation of each category level had to be present in the training data for categorical variables. This constraint was necessary to avoid situations, where the testing data contained category levels that were absent in the training set. Experiments were repeated 100 times (with diﬀerent random splits) and the average performance was recorded.
Hardware Specification	Yes	The C++ code takes less than 0.5 seconds on a single core of an Intel Xeon E5-2690 CPU to generate a 1000x1000 random rotation matrix.
Software Dependencies	Yes	It uses the Eigen template library (Guennebaud et al., 2010) and a Mersenne Twister (Matsumoto and Nishimura, 1998) pseudorandom number generator.
Experiment Setup	Yes	In all cases we used default parameters for the tree induction algorithms, except that we built 5000 trees for each ensemble in the hope of achieving full convergence. To evaluate the performance of random rotations, we ranked each method for each data set and computed the average rank across all data sets.