reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Communication-Efficient Distributed Covariance Sketch, with Application to Distributed PCA

Authors: Zengfeng Huang, Xuemin Lin, Wenjie Zhang, Ying Zhang

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate of the trade-off between covariance error and communication cost on synthetic data generated from a natural statistical model. The communication cost is measured in terms of the total number of rows sent. All real numbers are stored as double-precision ﬂoats. Given an input matrix, the accuracy of an approximation matrix B is measure by AT A BT B . For randomized algorithms, the reported errors and costs are the average values of 10 independent executions. For deterministic algorithms, the actual errors they incurs could be much smaller than their worst-case guarantees; in our evaluations, we always report their actual errors in each experimental setting. All algorithms are implemented in MATLAB.
Researcher Affiliation	Academia	Zengfeng Huang EMAIL School of Data Science Fudan University Shanghai, China; Xuemin Lin EMAIL School of Computer Science and Engineering The University of New South Wales Sydney, Australia; Wenjie Zhang EMAIL School of Computer Science and Engineering The University of New South Wales Sydney, Australia; Ying Zhang EMAIL School of Computer Science University of Technology, Sydney Sydney, Australia
Pseudocode	Yes	Algorithm 1 SVS(A, g): A Rn d; g is the sampling function. 1: Set B empty 2: Compute (U, Σ, V ) = SVD(A) 3: Set xj = 1 with probability g(σ2 j ), and xj = 0 with probability 1 g(σ2 j ) 4: Let wj = σj/ q g(σ2 j ) 5: Append v T j rescaled by xjwj (i.e., xj wj V T j ) to B, where vj is the j-th right singular vector 6: Remove zero rows in B 7: Output B
Open Source Code	No	The paper does not provide an explicit statement or a specific link indicating that the source code for the described methodology is publicly available. While it mentions implementation in MATLAB in the numerical simulations section, this does not constitute a concrete access to source code.
Open Datasets	No	The paper states, 'We use the same synthetic data sets as in Liberty (2013); Ghashami et al. (2014a). We generate our data sets using the following distributions. The input matrix is A = SDU + N/ζ.' While it references prior work for the generation of synthetic data, it does not provide concrete access information (link, DOI, repository) for a publicly available or open dataset that was used in their experiments. Instead, the authors generated their own synthetic data according to a described distribution.
Dataset Splits	No	The paper describes generating synthetic datasets for numerical simulations but does not provide specific training, testing, or validation dataset splits. The evaluation focuses on the trade-off between covariance error and communication cost, not on model performance requiring such data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments. It only mentions that 'All algorithms are implemented in MATLAB.'
Software Dependencies	No	The paper states that 'All algorithms are implemented in MATLAB' but does not provide a specific version number for MATLAB or any other ancillary software dependencies.
Experiment Setup	Yes	In the ﬁrst set of experiments, we evaluate the accuracy of the four algorithms on six sythetic data sets where noise ratio ζ varies from 4 to 12 and signal dimension t is set to 30 and 40. For the ease of comparison, the communication cost for each algorithm is tuned to be roughly 20 per machine. Figure 1 shows that the error of each algorithm increases when the number of machines grows from 20 to 160... In Figure 2, we evaluate the trade-off between covariance error and communication costs. In these set of experiments, we ﬁx the number of machines to s = 128.