Distributed Nonparametric Regression Imputation for Missing Response Problems with Large-scale Data
Authors: Ruoyu Wang, Miaomiao Su, Qihua Wang
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed methods are evaluated through simulation studies and illustrated in a real data analysis. Keywords: Distributed data, Divide and conquer, Kernel method, Missing data, Sieve method |
| Researcher Affiliation | Academia | Ruoyu Wang EMAIL Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China Miaomiao Su EMAIL Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China Qihua Wang EMAIL Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China |
| Pseudocode | Yes | Algorithm 1 Algorithm for the KDI method Algorithm 2 Algorithm for the SDI method Algorithm 3 DWCV algorithm for the KDI method Algorithm 4 DWCV algorithm for the SDI method |
| Open Source Code | Yes | The code to produce the results in the simulation and the real data analysis is available at https://github.com/stat-conifer/Dist Nonpar Imp. |
| Open Datasets | Yes | Group Lens Research has collected and made available movie rating data sets on the Movie Lens website (https://movielens.org). In this section, we apply our method to a largescale movie rating dataset, the ml-25m dataset. |
| Dataset Splits | Yes | Split data on the l-th machine into training data of size n/2 with index set I(l) tr and test data of size n/2 with index set I(l) te for l = 1, . . . , L; |
| Hardware Specification | Yes | All computations are performed in the R Programming (R Core Team, 2016) using a windows server with a 24-core processor and 128GB RAM. |
| Software Dependencies | No | The paper mentions 'R Programming (R Core Team, 2016)' but does not specify a precise version number for R or any other software libraries used, which is required for reproducibility. |
| Experiment Setup | Yes | We fix the total sample size N = 2 × 10^5 and vary the number of machines L = 10, 20, 50, 100, 200, and 500 to evaluate the effect of machine number. A kernel function of order 20 based on Legendre Polynomial (Berlinet, 1993) is used to implement the kernel regression imputation method. The constant c is taken to be 1.3 when d = 5 and 1.7 when d = 15. We take the constant c to be 0.5 when d = 5 and 0.9 when d = 15. |