reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Maximum sampled conditional likelihood for informative subsampling

Authors: HaiYing Wang, Jae Kwang Kim

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments are provided to evaluate the practical performance of the proposed method.
Researcher Affiliation	Academia	Hai Ying Wang EMAIL Department of Statistics University of Connecticut Storrs, CT 06269, USA Jae Kwang Kim EMAIL Department of Statistics Iowa State University Ames, IA 50011, USA
Pseudocode	No	The paper discusses various algorithms such as the Fisher-scoring algorithm and Newton's method, but it does not present any of them in a structured pseudocode or algorithm block. The steps are described within the regular text.
Open Source Code	No	The paper states: "We implemented all the algorithm in Julia (Bezanson et al., 2017) on a Desktop running Ubuntu 20.04." It mentions using Flux.jl but does not provide any explicit statement or link to the source code for the methodology described in the paper.
Open Datasets	Yes	Real data example: cover type data (Blackard and Dean, 1999). This dataset contains N = 581,012 observations on ten quantitative variables... To demonstrate the performance of the MSCLE, we applied it to the cover type data (Blackard and Dean, 1999)... In this section, we illustrate the advantage of the MSCLE over the IPW estimator using the famous MNIST data that is available at http://yann.lecun.com/exdb/mnist/.
Dataset Splits	Yes	We set the full data sample size N = 10^6, and let the subsample sizes be n = 500; 1000; 1500; and 2000... We used a smaller number of iterations here because the variations of the computational costs across diﬀerent repetitions are much smaller than that of the estimators... The data has a training set with 60,000 instances and a testing set with 10,000 instances... We use a subsample of average size n = 5,000 out of N = 60,000 (about 8.3% of the training data) to train the model. The pilot probabilities p(xi, θplt) s are obtained from a pilot model trained with 2,000 uniformly selected instances from the training set.
Hardware Specification	Yes	We implemented all the algorithm in Julia (Bezanson et al., 2017) on a Desktop running Ubuntu 20.04. We restricted all the calculations to use one thread of the CPU with a base frequency of 2,200 megahertz and a maximum boosted frequency of 4,549 megahertz.
Software Dependencies	No	We implemented all the algorithm in Julia (Bezanson et al., 2017) on a Desktop running Ubuntu 20.04... We implement the convolutional neural network Le Net-5 (Le Cun et al., 1998) with Flux.jl (Innes, 2018). While Julia and Flux.jl are mentioned, specific version numbers for these software components are not provided.
Experiment Setup	Yes	We set the full data sample size N = 10^6, and let the subsample sizes be n = 500; 1000; 1500; and 2000. We assume that the responses have three possible categories (K = 3), and let the dimension of the covariates xi = (1, x T 1,i)T s be d = 4 where the ﬁrst element of one is for the intercept parameters... We repeat the simulation for R = 1000 times and calculate the empirical mean squared error (MSE)... We used a smaller number of iterations here because the variations of the computational costs across diﬀerent repetitions are much smaller than that of the estimators. Results for case (a) with multivariate normal covariates are reported in Table 1.