Streaming Heteroscedastic Probabilistic PCA with Missing Data
Authors: Kyle Gilman, David Hong, Jeffrey A Fessler, Laura Balzano
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments demonstrate the superior subspace estimation of our method compared to state-of-the-art streaming PCA algorithms in the heteroscedastic setting. Finally, we illustrate SHASTA-PCA applied to highly heterogeneous real data from astronomy. |
| Researcher Affiliation | Academia | Kyle Gilman EMAIL Department of Electrical Engineering and Computer Science University of Michigan David Hong EMAIL Department of Electrical and Computer Engineering University of Delaware Jeffrey A. Fessler EMAIL Department of Electrical Engineering and Computer Science University of Michigan Laura Balzano EMAIL Department of Electrical Engineering and Computer Science University of Michigan |
| Pseudocode | Yes | Algorithm 1: SHASTA-PCA Input: Rank k, weights (wt) (0, 1], parameters c F , cv > 0, initialization parameter δ > 0. Data: [y1, . . . , y T ], yt Rd, group memberships gt {1, 2, . . . , L} for all t, and sets of observed indices (Ω1, . . . , ΩT ), where Ωt {1, . . . , d}. Output: F Rd k, v RL +. |
| Open Source Code | Yes | Code for our project can be found at https://github.com/kgilman/Streaming-Heteroscedastic-PPCA. |
| Open Datasets | Yes | We illustrate SHASTA-PCA on real astronomy data from the Sloan Digital Sky Survey (SDSS) Data Release 16 (Ahumada et al., 2020) using the associated DR16Q quasar catalog (Lyke et al., 2020). |
| Dataset Splits | Yes | We then formed a training dataset with two groups: first, we collected samples starting from sample index 6,500 to the last index where the noise variance estimate is less than or equal to 1 (7,347); second, we collected training data beginning at the first index where the noise variance estimate is greater than or equal to 2 (8,839) up to the sample index 10,449, excluding the last 10 samples that are grossly corrupted. The resulting training dataset had n1 = 848 and n2 = 1,611 samples for the two groups, respectively... To study this case, we randomly obscure 60% of the entries uniformly at random and perform 10 passes over the data, randomizing the order of the samples each time. |
| Hardware Specification | Yes | All experiments were performed in Julia on a 2021 Macbook Pro with the Apple M1 Pro processor and 16 GB of memory. |
| Software Dependencies | No | All experiments were performed in Julia on a 2021 Macbook Pro with the Apple M1 Pro processor and 16 GB of memory. We reproduced and implemented all algorithms ourselves from their original source works. We used Chat GPT to convert the authors original MATLAB code (https://github.com/thanhtbt/RST) to Julia, and we validated its outputs. No specific version numbers for Julia or any other libraries are mentioned. |
| Experiment Setup | Yes | For SHASTA-PCA, we use wt = 1/t (where t is the time index), c F = cv = 0.1 and initialize the parameters Rt(i) = δI with δ = 0.1 for both SHASTA-PCA and PETRELS. We initialize each streaming algorithm with the same random F0, and each entry of v0 for SHASTA-PCA uniformly at random between 0 and 1. ... Here, we empirically selected the constant parameters wt = 0.01, c F = 0.01, and cv = 0.1 for SHASTA-PCA that do not decay with time to adaptively track the dynamics of the subspace. After hyperparameter tuning, we set the step size of GROUSE to be 0.02 and set the forgetting factor for PETRELS to be λ = 0.998. |