PRIMO: Private Regression in Multiple Outcomes

Authors: Seth Neel

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, on the task of genomic risk prediction with multiple phenotypes we find that even for values of l far smaller than the theory would predict, our projection-based method improves the accuracy relative to the variant that doesn t use the projection. (...) In Section 6, we implement our Reuse Cov Gauss and Reuse Cov Proj algorithms using SNP data from two of the most common genomic databases Mailman et al. (2007); Fairley et al. (2019). We evaluate the accuracy of our algorithms Reuse Cov Gauss and Reuse Cov Proj X on the task of genomic risk prediction using real X data from two of the largest publicly available databases, and simulated outcomes Y so that we can easily vary the number of outcomes l.
Researcher Affiliation Academia Seth Neel EMAIL Harvard Business School
Pseudocode Yes Algorithm 1 Input: n, λ, X X n Rd n, Y = [y1, . . . yl] Yl n, privacy params: ϵ, δ Reuse Cov Algorithm 2 Input: X X n (Rd)n, Y = [y1, . . . yl] Yl n, privacy params: ϵ, δ. Gauss Proj Y Algorithm 3 Input: λ, X X n Rd n, Y = [y1, . . . yl] Yl n, privacy params: ϵ, δ. We denote by B the Algorithm in Lemma 2.2 Hager (2001) λ Reuse Cov Algorithm 4 Input: s, λ, X X n Rd n, Y = [y1, . . . yl] Yl n, privacy params: ϵ, δ λ Sub Samp Reuse Cov
Open Source Code No The paper does not provide concrete access to source code for the methodology described. It mentions using a third-party library, Opacus, but not their own implementation code.
Open Datasets Yes The genomic datasets are from two sources: the 1000 Genomes project (1KG) Fairley et al. (2019), and the Database of Genomes and Phenotypes Mailman et al. (2007) (accession phs000688.v1.p1). (...) In addition, in Appendix 7.8 we include experiments on two additional datasets over smaller values of l, one constructed by sub-sampling MNIST Deng (2012) and generating synthetic outcomes from a noisy linear model...
Dataset Splits No The paper describes generating synthetic data and running experiments multiple times, but does not provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or references to standard splits).
Hardware Specification No The paper mentions that they do not have experiments with larger values of d for computational reasons, implying hardware was used, but it does not specify any exact GPU/CPU models, processor types, or memory details.
Software Dependencies No The paper mentions using 'the Opacus Yousefpour et al. (2021) library from Meta' and that it is a 'Py Torch' library, but it does not provide specific version numbers for these software components.
Experiment Setup Yes For a given dataset and setting of (n, d, l) we run Reuse Cov Proj X and Reuse Cov Gauss 10 different times, and calculate the resulting average MSE. (...) In Figures (a)-(d) we plot the average R2 for d = 25, l = (1, 11, 101, 201, 401, 601, 801, 1001), fixing (ϵ, δ) = (5, 1 n2 ) with n = 5008, 6042. (...) After centering our haplotype matrix X by subtracting off the row means we generate synthetic phenotypes yi for i = 1 . . . l by generating a random θi N(0, Id d), yij θi xj + N(0, 1).