reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distributed Estimation on Semi-Supervised Generalized Linear Model

Authors: Jiyuan Tu, Weidong Liu, Xiaojun Mao

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, several simulation analyses and real data studies are provided to demonstrate the effectiveness of our method.
Researcher Affiliation	Academia	Jiyuan Tu EMAIL School of Mathematics Shanghai University of Finance and Economics, Shanghai, 200433, China Weidong Liu EMAIL School of Mathematical Sciences Mo E Key Lab of Artiﬁcial Intelligence Shanghai Jiao Tong University, Shanghai, 200240, China Xiaojun Mao EMAIL School of Mathematical Sciences Ministry of Education Key Laboratory of Scientiﬁc and Engineering Computing Shanghai Jiao Tong University, Shanghai, 200240, China
Pseudocode	Yes	Algorithm 1 Semi-Supervised Distributed Approximate NEwton Method (SSDANE) ... Algorithm 2 Semi-Supervised Distributed Approximate NEwton with Average (SSDANE-Avg)
Open Source Code	No	The paper does not provide explicit statements about releasing code for their methodology, nor does it include a link to a code repository or mention code in supplementary materials.
Open Datasets	Yes	In this section, we analyze the Celeb A dataset2 from the Kaggle website, which is included in LEAF (Caldas et al., 2018), a standard distributed learning benchmark. ... 2 https://www.kaggle.com/datasets/jessicali9530/celeba-dataset
Dataset Splits	Yes	We take the total sample size as 120000, and randomly partition the dataset into 20000 testing data, 20000 labeled training data, and 80000 unlabeled training data.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments. It only mentions general 'computing units' without specific models or specifications.
Software Dependencies	No	The paper does not list any specific software dependencies with version numbers used for the experiments.
Experiment Setup	Yes	Parameter Settings In both models, we assume the i.i.d. covariate vectors Xi = (Xi,1, ..., Xi,p)T are drawn from a multivariate normal distribution N(0, Σ) for i = 1, ..., N. Here the covariance matrix Σ is a p p Toeplitz matrix with its (i, j)-th entry Σij = 0.5\|i j\|, where 1 i, j p. We ﬁx dimension p = 20 and the true coeﬃcient β = (1, 0.95, 0.9, ..., 0.1, 0.05). We repeat 100 independent simulations and report the averaged estimation error and the corresponding standard error. ... Eﬀect of the Number of Machines and Local Unlabeled Data To investigate the eﬀect of the number of machines and local unlabeled data, we ﬁx the labeled local sample size n to be 100, and vary the number of machines m from {20, 50, 100}, and the unlabeled local sample size n from {100, 200, 500}. ... For the choice of the initial estimator, we uniformly use the local estimator on the master machine H1. ... To solve the optimization problem (13) for the logistic regression model, we apply conjugate gradient descent motivated by Minka (2003). ... We consider three cases where (m, n) = (20, 1000),(40, 500) and (80, 250) respectively.