reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Optimal subsampling for high-dimensional partially linear models via machine learning methods

Authors: Yujing Shao, Lei Wang, Heng Lian, Haiying Wang

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Simulation studies and an empirical analysis of the Physicochemical Properties of Protein Tertiary Structure dataset demonstrate the superior performance of our subsample estimators.
Researcher Affiliation	Academia	Yujing Shao EMAIL Lei Wang EMAIL School of Statistics and Data Science KLMDASR, LEBPS and LPMC Nankai University, China; Heng Lian EMAIL Department of Mathematics City University of Hong Kong, China; Haiying Wang EMAIL Department of Statistics University of Connecticut, U.S.A.
Pseudocode	Yes	Algorithm 1 Two-step Neyman-orthogonal score subsampling algorithm for PLMs
Open Source Code	No	The text discusses the source code of a third-party tool or platform that the authors used (e.g., R packages glmnet, gbm, and randomForest), but does not provide their own implementation code for the methodology described in this paper.
Open Datasets	Yes	Physicochemical Properties of Protein Tertiary Structure (PTS) dataset, available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/265/physicochemical+properties+of+protein+tertiary+structure.
Dataset Splits	Yes	the full-data DML estimator ˆθF with two-fold random partition; D p,1 and D p,2 are two non-overlapping chunks of equal size r0/2 from D p, the penalized ML estimators mp k and lp k are acquired using D p\D p,k according to (6)-(7) for k = 1, 2, respectively.
Hardware Specification	No	The paper discusses computational time but does not provide any specific hardware details such as GPU/CPU models or processor types.
Software Dependencies	Yes	In this paper, we consider three ML methods: Lasso, Gradient boosted machines (Gbm), and Random forest (Rf), which are implemented in R packages glmnet (Friedman et al., 2010), gbm (Greenwell et al., 2022), and random Forest (Liaw and Wiener, 2002), respectively.
Experiment Setup	Yes	The full data size is set to n = 10^6, with the true parameter θ0 = (1, 1, 1, 1)T, p = 4, and q = 200 or 600. We set r0 = 600 and r = 600, 800, 1000, 1200. We consider the following three forms of g0( ): ... where γ0 = (γ01, . . . , γ0s, 0, . . . , 0)T Rq with γ0j = 0.4(1 + j/2s) and s = 10. All the simulation results are based on 500 replications.