Distributed Algorithms for U-statistics-based Empirical Risk Minimization
Authors: Lanjue Chen, Alan T.K. Wan, Shuyi Zhang, Yong Zhou
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7. Simulation Studies The purpose of this section is to examine the finite sample performance of the proposed methods via a large scale simulation study. We focus on the U-ERM problems of pairwise ranking and smoothed rank-based estimation under the accelerated failure time model, both being introduced in Section 3.2. For comparison purposes, we also consider the gold-standard method that uses the full set of data all in one go to obtain the global U-estimates, and the naive method that averages local U-estimates obtained across different subsets of data. |
| Researcher Affiliation | Academia | Lanjue Chen EMAIL Key Laboratory of Advanced Theory and Application in Statistics and Data Science, Ministry of Education, Academy of Statistics and Interdisciplinary Sciences and School of Statistics, East China Normal University, Shanghai, China Alan T.K. Wan EMAIL Department of Management Sciences, School of Data Science and Department of Biostatistics City University of Hong Kong, Kowloon, Hong Kong Shuyi Zhang EMAIL Key Laboratory of Advanced Theory and Application in Statistics and Data Science, Ministry of Education, Academy of Statistics and Interdisciplinary Sciences and School of Statistics, East China Normal University, Shanghai, China Yong Zhou EMAIL Key Laboratory of Advanced Theory and Application in Statistics and Data Science, Ministry of Education, Academy of Statistics and Interdisciplinary Sciences and School of Statistics, East China Normal University, Shanghai, China |
| Pseudocode | Yes | Algorithm 1: Distributed iterative algorithm based on surrogate empirical risk (Dia SER) and Algorithm 2: Distributed iterative algorithm based on one-step estimation (Dia OSE) |
| Open Source Code | No | No specific statement or link for open-source code was found. The license information provided relates to the paper itself, not the implementation code. |
| Open Datasets | No | The paper uses simulated data for its experiments, as described in sections 7.1.1 and 7.2.1: 'We set θ = (1/ 5) and generate Xi from a multivariate N(0, Σ) distribution...' and 'We generate Xi from N(0, Σ) and the random errors ζi from the standard extreme value distribution...'. It does not provide access information for publicly available datasets. |
| Dataset Splits | Yes | The data DN is evenly divided into K smaller subsets {Dk}K k=1, each of size n. In simulation studies, the sample size of each subset n is fixed (e.g., n = 50) and K, the number of machines, is varied. For example, 'we fix the sample size of each subset to be n = 50 and vary K, the number of machines, from 25 to 200.' Also, 'for K = 50, 100, 150, 200, and 250, while fixing the total sample size to N = 60000. These values of K and N result in n = 120, 60, 40, 30, and 24.' |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running its experiments. It only discusses theoretical computational costs. |
| Software Dependencies | No | The paper mentions implementing the BFGS method via 'the R package optim with J = 1000 iterations' but does not provide specific version numbers for R or the optim package itself. |
| Experiment Setup | Yes | The paper specifies parameters for data generation, such as 'We set θ = (1/ 5) and generate Xi from a multivariate N(0, Σ) distribution, where Σij = 1 when i = j and Σij = 0.5 when i = j, and ε N(0, 1).' It also defines experimental parameters like 'fix the sample size of each subset to be n = 50 and vary K, the number of machines, from 25 to 200. We set the number of replications of the experiment to S = 200.' Additionally, it states, 'We implement the BFGS method via the R package optim with J = 1000 iterations.' |