Histogram Transform Ensembles for Large-scale Regression
Authors: Hanyuan Hang, Zhouchen Lin, Xiaoyu Liu, Hongwei Wen
JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Then, we validate the above theoretical results with extensive numerical experiments. On the one hand, simulations are conducted to elucidate that ensemble NHT outperforms single NHT. On the other hand, the effects of bin sizes on the accuracy of both NHT and KHT are also in accord with the theoretical analysis. Last but not least, in the real-data experiments, comparisons between the ensemble KHT, equipped with adaptive histogram transforms, and other state-of-the-art large-scale regression estimators verify the effectiveness and precision of the proposed algorithm. Numerical experiments are conducted in Section 4 to verify our theoretical results and to further witness the effectiveness and efficiency of our algorithm. |
| Researcher Affiliation | Academia | Hanyuan Hang EMAIL Department of Applied Mathematics, University of Twente 7522 NB Enschede, The Netherlands Zhouchen Lin EMAIL Key Lab. of Machine Perception (Mo E), School of EECS, Peking University 100871 Beijing, China Xiaoyu Liu EMAIL Hongwei Wen EMAIL Institute of Statistics and Big Data, Renmin University of China 100872 Beijing, China |
| Pseudocode | Yes | Algorithm 1: Histogram Transform Ensembles (HTE) Algorithm 2: Adaptive Splitting Algorithm 3: Adaptive Kernel Histogram Transform Ensembles (Adaptive KHTE) |
| Open Source Code | No | The text does not contain a clear statement that the authors are releasing their code for the methodology described in the paper, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | We carry out experiments based on a real data set PTS, the Physicochemical Properties of Protein Tertiary Structure Data Set, available on UCI. It contains totally 45,730 samples of 9 dimensions... AEP: The Appliances energy prediction (AEP) data set, available on UCI... HPP: This data set House-Price-8H prototask (HPP) is originally from DELVE dataset... CAD: This spacial data can be traced back to Pace and Barry (1997)... EGS: The Electrical Grid Stability Simulated Data (EGS) Data Set, belonging to the field of physics, is available on UCI... SCD: The Superconducting Material Database (SCD), available on UCI... ONP: The Online News Popularity Data Set (ONP), available on UCI... MSD: The Year Prediction MSD Data Set (MSD) is available on UCI. |
| Dataset Splits | Yes | It contains totally 45,730 samples of 9 dimensions, with 70% samples randomly selected as the training set, and the remaining 30% as the testing set. Whereas for the MSD data set, we adopt the following train/test split that the first 463, 715 examples are treated as training set and the last 51, 630 are treated as testing set. |
| Hardware Specification | No | Resources supporting this work were provided by High-performance Computing Platform of Renmin University of China. The paper mentions a high-performance computing platform but does not specify any particular hardware components like CPU or GPU models. |
| Software Dependencies | No | In this paper, we implement the random forest regressor through the package sklearn.ensemble for python, and more details on the parameter selection of RF can be found in Section 4.6.2. The paper mentions a software package and programming language but does not provide specific version numbers for any key software components used in their methodology. |
| Experiment Setup | Yes | We set the pair (T, m) to be (5, 1000) and (20, 1000) except for MSD data set, where we select (5, 2000) and (20, 3000), for the trade offbetween accuracy and running time. We adopt grid search method for other hyper-parameter selections. To be specific, for data sets HPP, CAD, PTS and AEP, EGS, SCD and ONP, the regularization parameter λ and the kernel bin width γ are selected from 7 and 8 values, from 10-3 to 103 and from 0.05 to 10, respectively, spaced evenly on a log scale with a geometric progression. For MSD data set, we choose λ in {0.01, 1, 100}, and γ in {0.001, 0.1, 10}. |