reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

More Efficient Estimation for Logistic Regression with Optimal Subsamples

Authors: HaiYing Wang

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the performance of the more efficient estimators in terms of both estimation efficiency and computational efficiency in this section. For simulation, to compare with the original OSMAC estimator, we use exactly the same setup used in Section 5.1 of Wang et al. (2018). Specifically, the full data sample size N = 10, 000 and the true value of β, βt, is a 7 1 vector of 0.5. The following 6 distributions of x are considered: multivariate normal distribution with mean zero (mz Normal), multivariate normal distribution with nonzero mean (nz Normal), multivariate normal distribution with mean zero and unequal variances (ue Normal), mixture of two multivariate normal distributions with different means (mix Normal), multivariate t distribution with degrees of freedom 3 (T3), and exponential distribution (EXP). [...] Figure 1 presents the relative efficiency of βuw and βp based on two different choices of πOS i : πAopt i and πLopt i . It is seen that in general βuw and βp are more efficient than βw. [...] We also calculate the empirical unconditional MSE by generating the full data in each repetition of the simulation. The results are similar and thus are omitted. To evaluate the performance of the proposed method with different choices of the subsampling probabilities for subsampling with replacement and Poisson subsampling, Figure 2 plots empirical MSEs of using πAopt, πLopt, πlcc (local case-control), and the uniform subsampling probability.
Researcher Affiliation	Academia	Hai Ying Wang EMAIL Department of Statistics University of Connecticut Storrs, CT 06269, USA
Pseudocode	Yes	Algorithm 1 More efficient estimation based on subsampling with replacement [...] Algorithm 2 More efficient estimation based on Poisson subsampling
Open Source Code	No	The paper does not contain any statement about releasing the code for the methodology described, nor any specific links to a code repository.
Open Datasets	Yes	We also apply the more efficient estimation methods to a supersymmetric (SUSY) benchmark data set (Baldi et al., 2014) available from the Machine Learning Repository (Dua and Karra Taniskidou, 2017).
Dataset Splits	Yes	We fixed the first step sample size n0 = 200 and choose n to be 100, 200, 400, 600, 800, and 1000. This is the same setup used in Wang et al. (2018). [...] We use the more efficient estimation methods with subsample size n to estimate parameters in logistic regression. Figures 4 gives the relative efficiency of βuw and βp to βw for both πLopt i and πAopt i .
Hardware Specification	Yes	All methods are implemented in the R programming language (R Core Team, 2017), and computations are carried out on a desktop running Ubuntu Linux 16.04 with an Intel I7 processor and 16GB RAM. Only one logical CPU is used for the calculation. [...] We also use a smaller computer with 8GB RAM to implement the method.
Software Dependencies	Yes	All methods are implemented in the R programming language (R Core Team, 2017)
Experiment Setup	Yes	The full data sample size N = 10, 000 and the true value of β, βt, is a 7 1 vector of 0.5. [...] We fixed the first step sample size n0 = 200 and choose n to be 100, 200, 400, 600, 800, and 1000. This is the same setup used in Wang et al. (2018). [...] We set the value of d to d = 50, the values of N to be N = 104, 105, 106 and 107, and the subsample sizes to be n0 = 200 and n = 1000.