reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Histogram Transform Ensembles for Large-scale Regression

Authors: Hanyuan Hang, Zhouchen Lin, Xiaoyu Liu, Hongwei Wen

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Then, we validate the above theoretical results with extensive numerical experiments. On the one hand, simulations are conducted to elucidate that ensemble NHT outperforms single NHT. On the other hand, the eﬀects of bin sizes on the accuracy of both NHT and KHT are also in accord with the theoretical analysis. Last but not least, in the real-data experiments, comparisons between the ensemble KHT, equipped with adaptive histogram transforms, and other state-of-the-art large-scale regression estimators verify the eﬀectiveness and precision of the proposed algorithm. Numerical experiments are conducted in Section 4 to verify our theoretical results and to further witness the eﬀectiveness and eﬃciency of our algorithm.
Researcher Affiliation	Academia	Hanyuan Hang EMAIL Department of Applied Mathematics, University of Twente 7522 NB Enschede, The Netherlands Zhouchen Lin EMAIL Key Lab. of Machine Perception (Mo E), School of EECS, Peking University 100871 Beijing, China Xiaoyu Liu EMAIL Hongwei Wen EMAIL Institute of Statistics and Big Data, Renmin University of China 100872 Beijing, China
Pseudocode	Yes	Algorithm 1: Histogram Transform Ensembles (HTE) Algorithm 2: Adaptive Splitting Algorithm 3: Adaptive Kernel Histogram Transform Ensembles (Adaptive KHTE)
Open Source Code	No	The text does not contain a clear statement that the authors are releasing their code for the methodology described in the paper, nor does it provide a direct link to a source-code repository.
Open Datasets	Yes	We carry out experiments based on a real data set PTS, the Physicochemical Properties of Protein Tertiary Structure Data Set, available on UCI. It contains totally 45,730 samples of 9 dimensions... AEP: The Appliances energy prediction (AEP) data set, available on UCI... HPP: This data set House-Price-8H prototask (HPP) is originally from DELVE dataset... CAD: This spacial data can be traced back to Pace and Barry (1997)... EGS: The Electrical Grid Stability Simulated Data (EGS) Data Set, belonging to the ﬁeld of physics, is available on UCI... SCD: The Superconducting Material Database (SCD), available on UCI... ONP: The Online News Popularity Data Set (ONP), available on UCI... MSD: The Year Prediction MSD Data Set (MSD) is available on UCI.
Dataset Splits	Yes	It contains totally 45,730 samples of 9 dimensions, with 70% samples randomly selected as the training set, and the remaining 30% as the testing set. Whereas for the MSD data set, we adopt the following train/test split that the ﬁrst 463, 715 examples are treated as training set and the last 51, 630 are treated as testing set.
Hardware Specification	No	Resources supporting this work were provided by High-performance Computing Platform of Renmin University of China. The paper mentions a high-performance computing platform but does not specify any particular hardware components like CPU or GPU models.
Software Dependencies	No	In this paper, we implement the random forest regressor through the package sklearn.ensemble for python, and more details on the parameter selection of RF can be found in Section 4.6.2. The paper mentions a software package and programming language but does not provide specific version numbers for any key software components used in their methodology.
Experiment Setup	Yes	We set the pair (T, m) to be (5, 1000) and (20, 1000) except for MSD data set, where we select (5, 2000) and (20, 3000), for the trade oﬀbetween accuracy and running time. We adopt grid search method for other hyper-parameter selections. To be speciﬁc, for data sets HPP, CAD, PTS and AEP, EGS, SCD and ONP, the regularization parameter λ and the kernel bin width γ are selected from 7 and 8 values, from 10-3 to 103 and from 0.05 to 10, respectively, spaced evenly on a log scale with a geometric progression. For MSD data set, we choose λ in {0.01, 1, 100}, and γ in {0.001, 0.1, 10}.