reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distributed Algorithms for U-statistics-based Empirical Risk Minimization

Authors: Lanjue Chen, Alan T.K. Wan, Shuyi Zhang, Yong Zhou

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	7. Simulation Studies The purpose of this section is to examine the ﬁnite sample performance of the proposed methods via a large scale simulation study. We focus on the U-ERM problems of pairwise ranking and smoothed rank-based estimation under the accelerated failure time model, both being introduced in Section 3.2. For comparison purposes, we also consider the gold-standard method that uses the full set of data all in one go to obtain the global U-estimates, and the naive method that averages local U-estimates obtained across diﬀerent subsets of data.
Researcher Affiliation	Academia	Lanjue Chen EMAIL Key Laboratory of Advanced Theory and Application in Statistics and Data Science, Ministry of Education, Academy of Statistics and Interdisciplinary Sciences and School of Statistics, East China Normal University, Shanghai, China Alan T.K. Wan EMAIL Department of Management Sciences, School of Data Science and Department of Biostatistics City University of Hong Kong, Kowloon, Hong Kong Shuyi Zhang EMAIL Key Laboratory of Advanced Theory and Application in Statistics and Data Science, Ministry of Education, Academy of Statistics and Interdisciplinary Sciences and School of Statistics, East China Normal University, Shanghai, China Yong Zhou EMAIL Key Laboratory of Advanced Theory and Application in Statistics and Data Science, Ministry of Education, Academy of Statistics and Interdisciplinary Sciences and School of Statistics, East China Normal University, Shanghai, China
Pseudocode	Yes	Algorithm 1: Distributed iterative algorithm based on surrogate empirical risk (Dia SER) and Algorithm 2: Distributed iterative algorithm based on one-step estimation (Dia OSE)
Open Source Code	No	No specific statement or link for open-source code was found. The license information provided relates to the paper itself, not the implementation code.
Open Datasets	No	The paper uses simulated data for its experiments, as described in sections 7.1.1 and 7.2.1: 'We set θ = (1/ 5) and generate Xi from a multivariate N(0, Σ) distribution...' and 'We generate Xi from N(0, Σ) and the random errors ζi from the standard extreme value distribution...'. It does not provide access information for publicly available datasets.
Dataset Splits	Yes	The data DN is evenly divided into K smaller subsets {Dk}K k=1, each of size n. In simulation studies, the sample size of each subset n is fixed (e.g., n = 50) and K, the number of machines, is varied. For example, 'we ﬁx the sample size of each subset to be n = 50 and vary K, the number of machines, from 25 to 200.' Also, 'for K = 50, 100, 150, 200, and 250, while ﬁxing the total sample size to N = 60000. These values of K and N result in n = 120, 60, 40, 30, and 24.'
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running its experiments. It only discusses theoretical computational costs.
Software Dependencies	No	The paper mentions implementing the BFGS method via 'the R package optim with J = 1000 iterations' but does not provide specific version numbers for R or the optim package itself.
Experiment Setup	Yes	The paper specifies parameters for data generation, such as 'We set θ = (1/ 5) and generate Xi from a multivariate N(0, Σ) distribution, where Σij = 1 when i = j and Σij = 0.5 when i = j, and ε N(0, 1).' It also defines experimental parameters like 'ﬁx the sample size of each subset to be n = 50 and vary K, the number of machines, from 25 to 200. We set the number of replications of the experiment to S = 200.' Additionally, it states, 'We implement the BFGS method via the R package optim with J = 1000 iterations.'