Assumption-lean and data-adaptive post-prediction inference
Authors: Jiacheng Miao, Xinran Miao, Yixuan Wu, Jiwei Zhao, Qiongshi Lu
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the statistical superiority and broad applicability of our method through simulations and realdata applications. |
| Researcher Affiliation | Academia | Jiacheng Miao EMAIL Department of Biostatistics and Medical Informatics University of Wisconsin-Madison Madison, WI 53726, USA Xinran Miao EMAIL Department of Statistics University of Wisconsin-Madison Madison, WI 53706, USA Yixuan Wu EMAIL University of Wisconsin-Madison Madison, WI 53726, USA Jiwei Zhao EMAIL Department of Biostatistics and Medical Informatics, Department of Statistics University of Wisconsin-Madison Madison, WI 53726, USA Qiongshi Lu EMAIL Department of Biostatistics and Medical Informatics University of Wisconsin-Madison Madison, WI 53726, USA |
| Pseudocode | Yes | Algorithm 1 PSPA estimation with ML-predicted labels Algorithm 2 PSPA estimation with ML-predicted covariates |
| Open Source Code | Yes | The R codes to implement PSPA, benchmark methods, and replicate the simulation and real data analysis, is available at https://github.com/qlu-lab/pspa. |
| Open Datasets | Yes | For example, the Genotype-Tissue Expression (GTEx) project is a comprehensive study focusing on gene expression regulation in many human tissues (GTEx Consortium et al., 2015). We regressed DXA-BMD on these variables using data from the UK Biobank (UKB). |
| Dataset Splits | Yes | The labeled data is with 500 samples, and the unlabeled data is with 500, 1500, 2500, 5000, or 10000 samples in different settings. In the UKB, DXA-BMD measurements are available for only 10% of the participants. Therefore, we employed the Softimpute algorithm to impute DXA-BMD values for the remaining 90% individuals in the unlabeled dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details such as CPU/GPU models, memory, or other hardware specifications used for running experiments. |
| Software Dependencies | No | The paper mentions using "R codes", a "pre-trained random forest", and the "Softimpute algorithm", but does not provide specific version numbers for any software libraries or programming languages. |
| Experiment Setup | Yes | In all simulations, the ground truth coefficients are obtained by a Monte Carlo approximation with 5 104 samples. The labeled data is with 500 samples, and the unlabeled data is with 500, 1500, 2500, 5000, or 10000 samples in different settings. A pre-trained random forest with 100 trees to grow is obtained from a hold-out dataset with 1000 samples. All simulations are repeated 1000 times. |