Assumption-lean and data-adaptive post-prediction inference

Authors: Jiacheng Miao, Xinran Miao, Yixuan Wu, Jiwei Zhao, Qiongshi Lu

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the statistical superiority and broad applicability of our method through simulations and realdata applications.
Researcher Affiliation Academia Jiacheng Miao EMAIL Department of Biostatistics and Medical Informatics University of Wisconsin-Madison Madison, WI 53726, USA Xinran Miao EMAIL Department of Statistics University of Wisconsin-Madison Madison, WI 53706, USA Yixuan Wu EMAIL University of Wisconsin-Madison Madison, WI 53726, USA Jiwei Zhao EMAIL Department of Biostatistics and Medical Informatics, Department of Statistics University of Wisconsin-Madison Madison, WI 53726, USA Qiongshi Lu EMAIL Department of Biostatistics and Medical Informatics University of Wisconsin-Madison Madison, WI 53726, USA
Pseudocode Yes Algorithm 1 PSPA estimation with ML-predicted labels Algorithm 2 PSPA estimation with ML-predicted covariates
Open Source Code Yes The R codes to implement PSPA, benchmark methods, and replicate the simulation and real data analysis, is available at https://github.com/qlu-lab/pspa.
Open Datasets Yes For example, the Genotype-Tissue Expression (GTEx) project is a comprehensive study focusing on gene expression regulation in many human tissues (GTEx Consortium et al., 2015). We regressed DXA-BMD on these variables using data from the UK Biobank (UKB).
Dataset Splits Yes The labeled data is with 500 samples, and the unlabeled data is with 500, 1500, 2500, 5000, or 10000 samples in different settings. In the UKB, DXA-BMD measurements are available for only 10% of the participants. Therefore, we employed the Softimpute algorithm to impute DXA-BMD values for the remaining 90% individuals in the unlabeled dataset.
Hardware Specification No The paper does not provide specific hardware details such as CPU/GPU models, memory, or other hardware specifications used for running experiments.
Software Dependencies No The paper mentions using "R codes", a "pre-trained random forest", and the "Softimpute algorithm", but does not provide specific version numbers for any software libraries or programming languages.
Experiment Setup Yes In all simulations, the ground truth coefficients are obtained by a Monte Carlo approximation with 5 104 samples. The labeled data is with 500 samples, and the unlabeled data is with 500, 1500, 2500, 5000, or 10000 samples in different settings. A pre-trained random forest with 100 trees to grow is obtained from a hold-out dataset with 1000 samples. All simulations are repeated 1000 times.