reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tree-Based Models for Correlated Data

Authors: Assaf Rabinowicz, Saharon Rosset

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The superiority of our new approach over tree-based models that do not account for the correlation, and over previous work that integrated some aspects of our approach, is supported by simulation experiments and real data analyses.
Researcher Affiliation	Academia	Assaf Rabinowicz EMAIL Department of Statistics and Operations Research Tel Aviv University Tel Aviv, Israel Saharon Rosset EMAIL Department of Statistics and Operations Research Tel Aviv University Tel Aviv, Israel
Pseudocode	Yes	Algorithm 1 REgression Tree for COrrelated Data (RETCO) ... Algorithm 2 A tree-based algorithm ... Algorithm 3 RE-EM Algorithm ... Algorithm 4 Mixed Random Forest (algorithm for a single tree)
Open Source Code	Yes	The code for the examples, as well as for the numerical part in Section 5, is written in Python and is available in https://github.com/Assaf Rab/RETCO.
Open Datasets	Yes	The data sets and the prediction problems are described in Section 5.2.1 ... Table 2: Data sets description. FIFA ... Kaggle, Crimes ... UCI, Korea Temperature ... UCI, California Housing ... Kaggle, Parkinson's Disease Telemonitoring ... UCI, Wages ... Brolgar package
Dataset Splits	Yes	FIFA ... the training set contains the observations of players from 20 clubs that were randomly sampled (548 observations), and the test set contains the observations of the other clubs (17, 939 observations). Communities and Crime ... The training set contains 15 clusters that were randomly sampled (790 observations), where the test set contains the other clusters (1, 204 observations). South Korea Temperature ... Measurements of the ﬁrst two years were selected (575 observations) in order to predict the maximal temperature of the same set of days in the next years (2, 325 observations). California Housing Prices ... 100 clusters are randomly sampled (279 observations) for the training set and the other clusters are used as the test set (12, 124 observations). Parkinson’s Disease Telemonitoring ... ﬁve individuals were randomly sampled (742 observations) for the training set and the others were designated as the test set (5, 179 observations). Wages ... 50 individuals were randomly sampled (331 observations) for the training set and the other are used as the test set (6, 071 observations).
Hardware Specification	No	The paper does not explicitly state the specific hardware used to run its experiments, such as CPU/GPU models or cloud resources.
Software Dependencies	No	The code for the examples, as well as for the numerical part in Section 5, is written in Python and is available in https://github.com/Assaf Rab/RETCO. While Python is mentioned, specific version numbers for Python or any libraries used are not provided.
Experiment Setup	Yes	In both algorithms, the stopping rules are depth of tree smaller than four, and number of observations in the terminal node greater than two. ... Additional parameters that are relevant for RF are: the maximal tree depth is 10, the number of regression trees is T = 100, a random half-sample method is used for sampling the training set for each tree (i.e., the training sample size for each tree is 250 without duplicates), three covariates are randomly selected at each split, following the rule of thumb of selecting randomly round log2(p) potential covariates at each split. ... In all implementations, the minimum number of observations in a node is 3. For RF implementations, the number of covariates that were sampled at each split is round log2(p) , the number of trees is 80, and the maximal tree depth is ten. For regression tree implementations, the maximal regression tree depth is ﬁve.