Optimal subsampling for high-dimensional partially linear models via machine learning methods
Authors: Yujing Shao, Lei Wang, Heng Lian, Haiying Wang
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Simulation studies and an empirical analysis of the Physicochemical Properties of Protein Tertiary Structure dataset demonstrate the superior performance of our subsample estimators. |
| Researcher Affiliation | Academia | Yujing Shao EMAIL Lei Wang EMAIL School of Statistics and Data Science KLMDASR, LEBPS and LPMC Nankai University, China; Heng Lian EMAIL Department of Mathematics City University of Hong Kong, China; Haiying Wang EMAIL Department of Statistics University of Connecticut, U.S.A. |
| Pseudocode | Yes | Algorithm 1 Two-step Neyman-orthogonal score subsampling algorithm for PLMs |
| Open Source Code | No | The text discusses the source code of a third-party tool or platform that the authors used (e.g., R packages glmnet, gbm, and randomForest), but does not provide their own implementation code for the methodology described in this paper. |
| Open Datasets | Yes | Physicochemical Properties of Protein Tertiary Structure (PTS) dataset, available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/265/physicochemical+properties+of+protein+tertiary+structure. |
| Dataset Splits | Yes | the full-data DML estimator ˆθF with two-fold random partition; D p,1 and D p,2 are two non-overlapping chunks of equal size r0/2 from D p, the penalized ML estimators mp k and lp k are acquired using D p\D p,k according to (6)-(7) for k = 1, 2, respectively. |
| Hardware Specification | No | The paper discusses computational time but does not provide any specific hardware details such as GPU/CPU models or processor types. |
| Software Dependencies | Yes | In this paper, we consider three ML methods: Lasso, Gradient boosted machines (Gbm), and Random forest (Rf), which are implemented in R packages glmnet (Friedman et al., 2010), gbm (Greenwell et al., 2022), and random Forest (Liaw and Wiener, 2002), respectively. |
| Experiment Setup | Yes | The full data size is set to n = 10^6, with the true parameter θ0 = (1, 1, 1, 1)T, p = 4, and q = 200 or 600. We set r0 = 600 and r = 600, 800, 1000, 1200. We consider the following three forms of g0( ): ... where γ0 = (γ01, . . . , γ0s, 0, . . . , 0)T Rq with γ0j = 0.4(1 + j/2s) and s = 10. All the simulation results are based on 500 replications. |