More Efficient Estimation for Logistic Regression with Optimal Subsamples

Authors: HaiYing Wang

JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of the more efficient estimators in terms of both estimation efficiency and computational efficiency in this section. For simulation, to compare with the original OSMAC estimator, we use exactly the same setup used in Section 5.1 of Wang et al. (2018). Specifically, the full data sample size N = 10, 000 and the true value of β, βt, is a 7 1 vector of 0.5. The following 6 distributions of x are considered: multivariate normal distribution with mean zero (mz Normal), multivariate normal distribution with nonzero mean (nz Normal), multivariate normal distribution with mean zero and unequal variances (ue Normal), mixture of two multivariate normal distributions with different means (mix Normal), multivariate t distribution with degrees of freedom 3 (T3), and exponential distribution (EXP). [...] Figure 1 presents the relative efficiency of βuw and βp based on two different choices of πOS i : πAopt i and πLopt i . It is seen that in general βuw and βp are more efficient than βw. [...] We also calculate the empirical unconditional MSE by generating the full data in each repetition of the simulation. The results are similar and thus are omitted. To evaluate the performance of the proposed method with different choices of the subsampling probabilities for subsampling with replacement and Poisson subsampling, Figure 2 plots empirical MSEs of using πAopt, πLopt, πlcc (local case-control), and the uniform subsampling probability.
Researcher Affiliation Academia Hai Ying Wang EMAIL Department of Statistics University of Connecticut Storrs, CT 06269, USA
Pseudocode Yes Algorithm 1 More efficient estimation based on subsampling with replacement [...] Algorithm 2 More efficient estimation based on Poisson subsampling
Open Source Code No The paper does not contain any statement about releasing the code for the methodology described, nor any specific links to a code repository.
Open Datasets Yes We also apply the more efficient estimation methods to a supersymmetric (SUSY) benchmark data set (Baldi et al., 2014) available from the Machine Learning Repository (Dua and Karra Taniskidou, 2017).
Dataset Splits Yes We fixed the first step sample size n0 = 200 and choose n to be 100, 200, 400, 600, 800, and 1000. This is the same setup used in Wang et al. (2018). [...] We use the more efficient estimation methods with subsample size n to estimate parameters in logistic regression. Figures 4 gives the relative efficiency of βuw and βp to βw for both πLopt i and πAopt i .
Hardware Specification Yes All methods are implemented in the R programming language (R Core Team, 2017), and computations are carried out on a desktop running Ubuntu Linux 16.04 with an Intel I7 processor and 16GB RAM. Only one logical CPU is used for the calculation. [...] We also use a smaller computer with 8GB RAM to implement the method.
Software Dependencies Yes All methods are implemented in the R programming language (R Core Team, 2017)
Experiment Setup Yes The full data sample size N = 10, 000 and the true value of β, βt, is a 7 1 vector of 0.5. [...] We fixed the first step sample size n0 = 200 and choose n to be 100, 200, 400, 600, 800, and 1000. This is the same setup used in Wang et al. (2018). [...] We set the value of d to d = 50, the values of N to be N = 104, 105, 106 and 107, and the subsample sizes to be n0 = 200 and n = 1000.