reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Streaming Heteroscedastic Probabilistic PCA with Missing Data

Authors: Kyle Gilman, David Hong, Jeffrey A Fessler, Laura Balzano

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments demonstrate the superior subspace estimation of our method compared to state-of-the-art streaming PCA algorithms in the heteroscedastic setting. Finally, we illustrate SHASTA-PCA applied to highly heterogeneous real data from astronomy.
Researcher Affiliation	Academia	Kyle Gilman EMAIL Department of Electrical Engineering and Computer Science University of Michigan David Hong EMAIL Department of Electrical and Computer Engineering University of Delaware Jeffrey A. Fessler EMAIL Department of Electrical Engineering and Computer Science University of Michigan Laura Balzano EMAIL Department of Electrical Engineering and Computer Science University of Michigan
Pseudocode	Yes	Algorithm 1: SHASTA-PCA Input: Rank k, weights (wt) (0, 1], parameters c F , cv > 0, initialization parameter δ > 0. Data: [y1, . . . , y T ], yt Rd, group memberships gt {1, 2, . . . , L} for all t, and sets of observed indices (Ω1, . . . , ΩT ), where Ωt {1, . . . , d}. Output: F Rd k, v RL +.
Open Source Code	Yes	Code for our project can be found at https://github.com/kgilman/Streaming-Heteroscedastic-PPCA.
Open Datasets	Yes	We illustrate SHASTA-PCA on real astronomy data from the Sloan Digital Sky Survey (SDSS) Data Release 16 (Ahumada et al., 2020) using the associated DR16Q quasar catalog (Lyke et al., 2020).
Dataset Splits	Yes	We then formed a training dataset with two groups: first, we collected samples starting from sample index 6,500 to the last index where the noise variance estimate is less than or equal to 1 (7,347); second, we collected training data beginning at the first index where the noise variance estimate is greater than or equal to 2 (8,839) up to the sample index 10,449, excluding the last 10 samples that are grossly corrupted. The resulting training dataset had n1 = 848 and n2 = 1,611 samples for the two groups, respectively... To study this case, we randomly obscure 60% of the entries uniformly at random and perform 10 passes over the data, randomizing the order of the samples each time.
Hardware Specification	Yes	All experiments were performed in Julia on a 2021 Macbook Pro with the Apple M1 Pro processor and 16 GB of memory.
Software Dependencies	No	All experiments were performed in Julia on a 2021 Macbook Pro with the Apple M1 Pro processor and 16 GB of memory. We reproduced and implemented all algorithms ourselves from their original source works. We used Chat GPT to convert the authors original MATLAB code (https://github.com/thanhtbt/RST) to Julia, and we validated its outputs. No specific version numbers for Julia or any other libraries are mentioned.
Experiment Setup	Yes	For SHASTA-PCA, we use wt = 1/t (where t is the time index), c F = cv = 0.1 and initialize the parameters Rt(i) = δI with δ = 0.1 for both SHASTA-PCA and PETRELS. We initialize each streaming algorithm with the same random F0, and each entry of v0 for SHASTA-PCA uniformly at random between 0 and 1. ... Here, we empirically selected the constant parameters wt = 0.01, c F = 0.01, and cv = 0.1 for SHASTA-PCA that do not decay with time to adaptively track the dynamics of the subspace. After hyperparameter tuning, we set the step size of GROUSE to be 0.02 and set the forgetting factor for PETRELS to be λ = 0.998.