Tree-Based Models for Correlated Data
Authors: Assaf Rabinowicz, Saharon Rosset
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The superiority of our new approach over tree-based models that do not account for the correlation, and over previous work that integrated some aspects of our approach, is supported by simulation experiments and real data analyses. |
| Researcher Affiliation | Academia | Assaf Rabinowicz EMAIL Department of Statistics and Operations Research Tel Aviv University Tel Aviv, Israel Saharon Rosset EMAIL Department of Statistics and Operations Research Tel Aviv University Tel Aviv, Israel |
| Pseudocode | Yes | Algorithm 1 REgression Tree for COrrelated Data (RETCO) ... Algorithm 2 A tree-based algorithm ... Algorithm 3 RE-EM Algorithm ... Algorithm 4 Mixed Random Forest (algorithm for a single tree) |
| Open Source Code | Yes | The code for the examples, as well as for the numerical part in Section 5, is written in Python and is available in https://github.com/Assaf Rab/RETCO. |
| Open Datasets | Yes | The data sets and the prediction problems are described in Section 5.2.1 ... Table 2: Data sets description. FIFA ... Kaggle, Crimes ... UCI, Korea Temperature ... UCI, California Housing ... Kaggle, Parkinson's Disease Telemonitoring ... UCI, Wages ... Brolgar package |
| Dataset Splits | Yes | FIFA ... the training set contains the observations of players from 20 clubs that were randomly sampled (548 observations), and the test set contains the observations of the other clubs (17, 939 observations). Communities and Crime ... The training set contains 15 clusters that were randomly sampled (790 observations), where the test set contains the other clusters (1, 204 observations). South Korea Temperature ... Measurements of the first two years were selected (575 observations) in order to predict the maximal temperature of the same set of days in the next years (2, 325 observations). California Housing Prices ... 100 clusters are randomly sampled (279 observations) for the training set and the other clusters are used as the test set (12, 124 observations). Parkinson’s Disease Telemonitoring ... five individuals were randomly sampled (742 observations) for the training set and the others were designated as the test set (5, 179 observations). Wages ... 50 individuals were randomly sampled (331 observations) for the training set and the other are used as the test set (6, 071 observations). |
| Hardware Specification | No | The paper does not explicitly state the specific hardware used to run its experiments, such as CPU/GPU models or cloud resources. |
| Software Dependencies | No | The code for the examples, as well as for the numerical part in Section 5, is written in Python and is available in https://github.com/Assaf Rab/RETCO. While Python is mentioned, specific version numbers for Python or any libraries used are not provided. |
| Experiment Setup | Yes | In both algorithms, the stopping rules are depth of tree smaller than four, and number of observations in the terminal node greater than two. ... Additional parameters that are relevant for RF are: the maximal tree depth is 10, the number of regression trees is T = 100, a random half-sample method is used for sampling the training set for each tree (i.e., the training sample size for each tree is 250 without duplicates), three covariates are randomly selected at each split, following the rule of thumb of selecting randomly round log2(p) potential covariates at each split. ... In all implementations, the minimum number of observations in a node is 3. For RF implementations, the number of covariates that were sampled at each split is round log2(p) , the number of trees is 80, and the maximal tree depth is ten. For regression tree implementations, the maximal regression tree depth is five. |