reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Heterogeneity-aware Clustered Distributed Learning for Multi-source Data Analysis

Authors: Yuanxing Chen, Qingzhao Zhang, Shuangge Ma, Kuangnan Fang

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical studies, including simulation in Section 4 and data analysis in Section 5, demonstrate the practical utilization and superiority of the proposed approach. ... We conduct abundant simulations to gauge the performance of the proposed approach. ... In this section, we apply the proposed method... to a bank website logs data, which is stored in multiple interfaces (clients).
Researcher Affiliation	Academia	Yuanxing Chen yxchen EMAIL Department of Statistics and Data Science, School of Economics Xiamen University Xiamen, 361005, China. Qingzhao Zhang EMAIL Department of Statistics and Data Science, School of Economics The Wang Yanan Institute for Studies in Economics Xiamen University Xiamen, 361005, China. Shuangge Ma EMAIL Department of Biostatistics Yale University New Haven, CT 06520, USA. Kuangnan Fang EMAIL Department of Statistics and Data Science, School of Economics Xiamen University Xiamen, 361005, China.
Pseudocode	Yes	The proposed proximal ADMM algorithm is summarized as follows. Step 1. Obtain the initial estimates with (θ0, ξ0). Step 2. At iteration t, t = 1, 2, . . . , update θt as follows. Step 2.1. Initialize ut 1,0 = θt 1,0 = θt 1 and ρ0 = 1. Step 2.2. At iteration s, s = 1, 2, . . . , compute ... Step 2.3. Repeat Step 2.2 until convergence, and set θt θt 1,s. Step 3. For 1 k < k K, update ωt kk p τ( θ(k),t θ(k ),t 2, λ2). Step 4. Update ξt proxνh 1(νθt A + ξt 1). Step 5. Repeat Steps 2 4 until convergence, and set αt proxν 1h1(θt A + ν 1ξt).
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository.
Open Datasets	No	Section 4 describes 'Simulation Study' where data is generated for experiments, not from publicly available datasets. Section 5 describes 'Data Application' on 'a bank website logs data' which is not stated to be publicly available, nor is a link or citation provided for its access.
Dataset Splits	Yes	Speciﬁcally, we randomly select 4/5 of the samples and form the training data. In this selection, the normal: abnormal ratio is retained. The remaining samples form the testing data.
Hardware Specification	No	For example, the analysis of one simulated data set under Example 1 with K = 32, p = 100, and 25 candidate tuning parameter values takes about 3 minutes using a desktop with standard conﬁgurations here we note that penalized fusion estimation is in general computationally more expensive. The phrase "desktop with standard configurations" is too vague and does not provide specific hardware details.
Software Dependencies	No	For the SK estimator, we adopt two criteria, namely the Hartigan statistic (Hartigan, 1975) and gap statistic (Tibshirani et al., 2001), to choose the number of clusters this is realized using R package sparcl; and The corresponding two variants are referred to as SK(har) and SK(gap), respectively. For the CFL estimator, we separately analyze one-shot CFL (OCFL) and iterative CFL with multiple rounds (ICFL), where the number of clusters is speciﬁed as the true value for them. Here, both ICFL and OCFL correspond to Algorithm 2 of Ghosh et al. (2020), but the former sets the number of communication rounds as R = 100, while the latter sets R = 1. No version numbers are provided for the mentioned R packages or for the R language itself, nor for the 'Skip-gram model (which is a popular model of word2vec)' mentioned in Section 5.
Experiment Setup	Yes	Tuning parameter selection Following the literature, we set ν = 1 and the concavity related parameter τ = 3. Following Yang et al. (2019), we select λ1 and λ2 by minimizing the modiﬁed BIC deﬁned as m BIC(λ1, λ2) = 1 N P K k=1 nh bθ(k)(λ1, λ2) i e V(k) h bθ(k)(λ1, λ2) i 2 h bθ(k)(λ1, λ2) i eζ(k) + CN bq(λ1, λ2), where bq(λ1, λ2) is the number of nonzero distinct coeﬃcient vectors, and CN is a positive constant depending on N. Following Ma and Huang (2017), we adopt CN = log(log(Kp)), which can automatically adapt to a diverging number of parameters.