reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sparse PCA via Covariance Thresholding

Authors: Yash Deshpande, Andrea Montanari

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Figure 1 presents simulations on synthetic data under the strictly sparse model, and the Covariance Thresholding algorithm of Table 1, used in the proof of Theorem 3. The objective is to check whether the log p factor has any practical relevance or is a purely conceptual improvement. Figure 2 shows the performance of vanilla PCA, Diagonal Thresholding and Covariance Thresholding on the Three Peak example of Johnstone and Lu (2004).
Researcher Affiliation	Academia	Yash Deshpande EMAIL Department of Electrical Engineering Stanford University Stanford, CA 94305, USA Andrea Montanari EMAIL Departments of Electrical Engineering and Statistics Stanford University Stanford, CA 94305, USA
Pseudocode	Yes	Algorithm 1 Covariance Thresholding
Open Source Code	No	No explicit statement or link to source code is provided in the paper.
Open Datasets	No	Figure 1 presents simulations on synthetic data under the strictly sparse model... Figure 2 shows the performance of vanilla PCA, Diagonal Thresholding and Covariance Thresholding on the Three Peak example... A similar experiment with the box example of Johnstone and Lu is provided in Figure 3. The paper describes using synthetic data and established problem examples, but does not provide access information (links, DOIs, specific repositories) for any particular dataset used in the empirical evaluations.
Dataset Splits	No	For notational convenience, we shall assume that 2n sample vectors are given (instead of n): {xi}1 i 2n. We start by splitting the data into two halves: (xi)1 i n and (xi)n<i 2n and compute the respective sample covariance matrices G and G respectively. This describes how the data is used within the algorithm, not how a dataset is split for evaluation of a model's performance on unseen data (e.g., train/test/validation splits).
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the simulations or experiments.
Software Dependencies	No	The paper does not provide specific software names with version numbers used for implementation or experiments.
Experiment Setup	Yes	Choosing τ: Although in the statement of the theorem, our choice of τ depends on the SNR β/σ2, it is reasonable to instead threshold at the noise level , as follows. The noise component of entry i, j of the sample covariance (ignoring lower order terms) is given by σ2 zi, zj /n. By the central limit theorem, zi, zj / n d N(0, 1). Consequently, σ2 zi, zj /n N(0, σ4/n), and we need to choose the (rescaled) threshold proportional to σ4 = σ2. Using previous estimates, we let τ = ν bσ2 for a constant ν . In simulations, a choice 3 ν 4 appears to work well. Parameters for Covariance Thresholding are chosen as in Section 4, with ν = 4.5.