Optimal subsampling for high-dimensional partially linear models via machine learning methods

Authors: Yujing Shao, Lei Wang, Heng Lian, Haiying Wang

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Simulation studies and an empirical analysis of the Physicochemical Properties of Protein Tertiary Structure dataset demonstrate the superior performance of our subsample estimators.
Researcher Affiliation Academia Yujing Shao EMAIL Lei Wang EMAIL School of Statistics and Data Science KLMDASR, LEBPS and LPMC Nankai University, China; Heng Lian EMAIL Department of Mathematics City University of Hong Kong, China; Haiying Wang EMAIL Department of Statistics University of Connecticut, U.S.A.
Pseudocode Yes Algorithm 1 Two-step Neyman-orthogonal score subsampling algorithm for PLMs
Open Source Code No The text discusses the source code of a third-party tool or platform that the authors used (e.g., R packages glmnet, gbm, and randomForest), but does not provide their own implementation code for the methodology described in this paper.
Open Datasets Yes Physicochemical Properties of Protein Tertiary Structure (PTS) dataset, available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/265/physicochemical+properties+of+protein+tertiary+structure.
Dataset Splits Yes the full-data DML estimator ˆθF with two-fold random partition; D p,1 and D p,2 are two non-overlapping chunks of equal size r0/2 from D p, the penalized ML estimators mp k and lp k are acquired using D p\D p,k according to (6)-(7) for k = 1, 2, respectively.
Hardware Specification No The paper discusses computational time but does not provide any specific hardware details such as GPU/CPU models or processor types.
Software Dependencies Yes In this paper, we consider three ML methods: Lasso, Gradient boosted machines (Gbm), and Random forest (Rf), which are implemented in R packages glmnet (Friedman et al., 2010), gbm (Greenwell et al., 2022), and random Forest (Liaw and Wiener, 2002), respectively.
Experiment Setup Yes The full data size is set to n = 10^6, with the true parameter θ0 = (1, 1, 1, 1)T, p = 4, and q = 200 or 600. We set r0 = 600 and r = 600, 800, 1000, 1200. We consider the following three forms of g0( ): ... where γ0 = (γ01, . . . , γ0s, 0, . . . , 0)T Rq with γ0j = 0.4(1 + j/2s) and s = 10. All the simulation results are based on 500 replications.