Distributed Nonparametric Regression Imputation for Missing Response Problems with Large-scale Data

Authors: Ruoyu Wang, Miaomiao Su, Qihua Wang

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed methods are evaluated through simulation studies and illustrated in a real data analysis. Keywords: Distributed data, Divide and conquer, Kernel method, Missing data, Sieve method
Researcher Affiliation Academia Ruoyu Wang EMAIL Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China Miaomiao Su EMAIL Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China Qihua Wang EMAIL Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China
Pseudocode Yes Algorithm 1 Algorithm for the KDI method Algorithm 2 Algorithm for the SDI method Algorithm 3 DWCV algorithm for the KDI method Algorithm 4 DWCV algorithm for the SDI method
Open Source Code Yes The code to produce the results in the simulation and the real data analysis is available at https://github.com/stat-conifer/Dist Nonpar Imp.
Open Datasets Yes Group Lens Research has collected and made available movie rating data sets on the Movie Lens website (https://movielens.org). In this section, we apply our method to a largescale movie rating dataset, the ml-25m dataset.
Dataset Splits Yes Split data on the l-th machine into training data of size n/2 with index set I(l) tr and test data of size n/2 with index set I(l) te for l = 1, . . . , L;
Hardware Specification Yes All computations are performed in the R Programming (R Core Team, 2016) using a windows server with a 24-core processor and 128GB RAM.
Software Dependencies No The paper mentions 'R Programming (R Core Team, 2016)' but does not specify a precise version number for R or any other software libraries used, which is required for reproducibility.
Experiment Setup Yes We fix the total sample size N = 2 × 10^5 and vary the number of machines L = 10, 20, 50, 100, 200, and 500 to evaluate the effect of machine number. A kernel function of order 20 based on Legendre Polynomial (Berlinet, 1993) is used to implement the kernel regression imputation method. The constant c is taken to be 1.3 when d = 5 and 1.7 when d = 15. We take the constant c to be 0.5 when d = 5 and 0.9 when d = 15.