reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distributed Nonparametric Regression Imputation for Missing Response Problems with Large-scale Data

Authors: Ruoyu Wang, Miaomiao Su, Qihua Wang

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed methods are evaluated through simulation studies and illustrated in a real data analysis. Keywords: Distributed data, Divide and conquer, Kernel method, Missing data, Sieve method
Researcher Affiliation	Academia	Ruoyu Wang EMAIL Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China Miaomiao Su EMAIL Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China Qihua Wang EMAIL Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China
Pseudocode	Yes	Algorithm 1 Algorithm for the KDI method Algorithm 2 Algorithm for the SDI method Algorithm 3 DWCV algorithm for the KDI method Algorithm 4 DWCV algorithm for the SDI method
Open Source Code	Yes	The code to produce the results in the simulation and the real data analysis is available at https://github.com/stat-conifer/Dist Nonpar Imp.
Open Datasets	Yes	Group Lens Research has collected and made available movie rating data sets on the Movie Lens website (https://movielens.org). In this section, we apply our method to a largescale movie rating dataset, the ml-25m dataset.
Dataset Splits	Yes	Split data on the l-th machine into training data of size n/2 with index set I(l) tr and test data of size n/2 with index set I(l) te for l = 1, . . . , L;
Hardware Specification	Yes	All computations are performed in the R Programming (R Core Team, 2016) using a windows server with a 24-core processor and 128GB RAM.
Software Dependencies	No	The paper mentions 'R Programming (R Core Team, 2016)' but does not specify a precise version number for R or any other software libraries used, which is required for reproducibility.
Experiment Setup	Yes	We ﬁx the total sample size N = 2 × 10^5 and vary the number of machines L = 10, 20, 50, 100, 200, and 500 to evaluate the eﬀect of machine number. A kernel function of order 20 based on Legendre Polynomial (Berlinet, 1993) is used to implement the kernel regression imputation method. The constant c is taken to be 1.3 when d = 5 and 1.7 when d = 15. We take the constant c to be 0.5 when d = 5 and 0.9 when d = 15.