Random Forest Weighted Local Fréchet Regression with Random Objects

Authors: Rui Qiu, Zhou Yu, Ruoqing Zhu

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical studies show the superiority of our methods with several commonly encountered types of responses such as distribution functions, symmetric positive-definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to New York taxi data and human mortality data. Keywords: metric space, Fréchet regression, random forest, nonparametric regression, infinite order U-process
Researcher Affiliation Academia Rui Qiu EMAIL School of Statistics, KLATASDS-MOE East China Normal University Shanghai 200062, China Zhou Yu EMAIL School of Statistics, KLATASDS-MOE East China Normal University Shanghai 200062, China Ruoqing Zhu EMAIL Department of Statistics University of Illinois at Urbana-Champaign Champaign, IL 61820, USA
Pseudocode Yes Algorithm 1 : Variable importance calculation Inputs: A training set Dn = {(Xi, Yi)}n i=1, number of Fréchet trees B. Step 1. Construct a random forest consisting of B Fréchet trees {Tb(x; Db n, ξb)}B b=1 based on Dn, which generate the random forest kernel for the achievement of RFWLCFR. Step 2. for i = 1 to n do Identify the collection Ti of Fréchet trees whose growth (Xi, Yi) did not participate in: Ti = {Tb(x; Db n, ξb) : 1 b B, (Xi, Yi) / Db n}. Predict the response of Xi with RFWLCFR, denoted by ˆroob (Xi), based on the random forest kernel provided by Ti. end for Record the mean square error: R0 = 1 n Pn i=1 d2(ˆroob (Xi), Yi). Step 3. for j = 1 to p do Permute the values for the jth variable randomly in {Xi}n i=1 and repeat Step 2 with the permuted data and the same Ti, 1 i n, acquired in Step 2; Record the corresponding mean square error Rj. end for Step 4. Calculate the variable importance for the jth variable: VI(X(j)) = Rj R0, 1 j p.
Open Source Code No The paper does not explicitly state that source code for the methodology described in this paper is openly available, nor does it provide a direct link to a code repository. It mentions that "Julia code for the implementation of IFR can be found in the Git Hub platform" and "Our RFWLCFR and RFWLLFR are also implemented in R," but this refers to third-party tools or general implementation without providing specific access to *their* implementation.
Open Datasets Yes The New York City Taxi and Limousine Commission provides detailed records on yellow taxi rides, including pick-up and drop-off dates and times, pick-up and drop-off locations, trip distances, payment types, and other information. The data can be downloaded from https://www1.nyc. gov/site/tlc/about/tlc-trip-record-data.page. We also gather weather data for January and February 2019 from https://www.wunderground. com/history/daily/us/ny/new-york-city/KLGA/date The data are collected from United Nation Databases (http://data.un.org/) and UN World Population Prospects 2019 Databases (https://population.un.org/wpp/Download).
Dataset Splits Yes The data set consisting of 1416 samples is partitioned randomly into three parts for Fréchet regression: a training set of size 850, a validation set of size 283, and a testing set of size 283, following a ratio of 6 : 2 : 2. We then perform 9-fold testing to evaluate the performance of all Fréchet regression methods. Specifically, we divide the 162 countries into 9 parts evenly and conduct 9 training runs. For each run, one of the 9 parts is chosen as the testing set and the rest as the training set.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. It focuses on the methods and datasets.
Software Dependencies No The paper mentions R-package frechet (Chen et al., 2020), R-package FrechForest (Capitaine, 2021), Julia code for IFR, and R-package matrix-manifold (Lin, 2020), but it does not specify version numbers for these software dependencies, only the year of their publication or creation.
Experiment Setup Yes There are three hyperparameters for each Fréchet tree: the size sn of each subsample, the depth of Fréchet trees and the number of features randomly selected at each internal node. The choice of sn is very tedious and time-consuming. Here we instead acquire all subsamples by sampling from the training data set Dn with replacement, which is commonly used in random forest codes. When the size n of Dn is large enough, each subsample is expected to have the fraction (1 1/e) 63.2% of the unique examples of Dn. We consider 3 log2 n for the range of tuning about the depth of Fréchet trees, where n is the number of training samples. For a fair comparison, each method chooses the hyperparameters by cross-validation.