Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Communication-Efficient Distributed Covariance Sketch, with Application to Distributed PCA
Authors: Zengfeng Huang, Xuemin Lin, Wenjie Zhang, Ying Zhang
JMLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate of the trade-off between covariance error and communication cost on synthetic data generated from a natural statistical model. The communication cost is measured in terms of the total number of rows sent. All real numbers are stored as double-precision floats. Given an input matrix, the accuracy of an approximation matrix B is measure by AT A BT B . For randomized algorithms, the reported errors and costs are the average values of 10 independent executions. For deterministic algorithms, the actual errors they incurs could be much smaller than their worst-case guarantees; in our evaluations, we always report their actual errors in each experimental setting. All algorithms are implemented in MATLAB. |
| Researcher Affiliation | Academia | Zengfeng Huang EMAIL School of Data Science Fudan University Shanghai, China; Xuemin Lin EMAIL School of Computer Science and Engineering The University of New South Wales Sydney, Australia; Wenjie Zhang EMAIL School of Computer Science and Engineering The University of New South Wales Sydney, Australia; Ying Zhang EMAIL School of Computer Science University of Technology, Sydney Sydney, Australia |
| Pseudocode | Yes | Algorithm 1 SVS(A, g): A Rn d; g is the sampling function. 1: Set B empty 2: Compute (U, Σ, V ) = SVD(A) 3: Set xj = 1 with probability g(σ2 j ), and xj = 0 with probability 1 g(σ2 j ) 4: Let wj = σj/ q g(σ2 j ) 5: Append v T j rescaled by xjwj (i.e., xj wj V T j ) to B, where vj is the j-th right singular vector 6: Remove zero rows in B 7: Output B |
| Open Source Code | No | The paper does not provide an explicit statement or a specific link indicating that the source code for the described methodology is publicly available. While it mentions implementation in MATLAB in the numerical simulations section, this does not constitute a concrete access to source code. |
| Open Datasets | No | The paper states, 'We use the same synthetic data sets as in Liberty (2013); Ghashami et al. (2014a). We generate our data sets using the following distributions. The input matrix is A = SDU + N/ζ.' While it references prior work for the generation of synthetic data, it does not provide concrete access information (link, DOI, repository) for a publicly available or open dataset that was used in their experiments. Instead, the authors generated their own synthetic data according to a described distribution. |
| Dataset Splits | No | The paper describes generating synthetic datasets for numerical simulations but does not provide specific training, testing, or validation dataset splits. The evaluation focuses on the trade-off between covariance error and communication cost, not on model performance requiring such data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments. It only mentions that 'All algorithms are implemented in MATLAB.' |
| Software Dependencies | No | The paper states that 'All algorithms are implemented in MATLAB' but does not provide a specific version number for MATLAB or any other ancillary software dependencies. |
| Experiment Setup | Yes | In the first set of experiments, we evaluate the accuracy of the four algorithms on six sythetic data sets where noise ratio ζ varies from 4 to 12 and signal dimension t is set to 30 and 40. For the ease of comparison, the communication cost for each algorithm is tuned to be roughly 20 per machine. Figure 1 shows that the error of each algorithm increases when the number of machines grows from 20 to 160... In Figure 2, we evaluate the trade-off between covariance error and communication costs. In these set of experiments, we fix the number of machines to s = 128. |