Bootstrap-Based Regularization for Low-Rank Matrix Estimation
Authors: Julie Josse, Stefan Wager
JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To assess our proposed methods, we first run comparative simulation studies for different noise models. We begin with a sanity check: in Section 5.1, we reproduce the isotropic Gaussian noise experiments of Candès et al. (2013), and find that our method is competitive with existing approaches on this standard benchmark. We then move to the non-isotropic case, where we can take advantage of our method s ability to adapt to different noise structures. In Section 5.2 we show results on experiments with Poisson noise, and find that our method substantially outperforms its competitors. Finally, in Section 6, we apply our method to real-world applications motivated by topic modeling and sensory analysis. |
| Researcher Affiliation | Academia | Julie Josse EMAIL Department of Applied Mathematics, Agrocampus Ouest, Rennes, France, INRIA Saclay, Université Paris-Sud, Orsay, France. Stefan Wager EMAIL Department of Statistics, Stanford University, Stanford, U.S.A. |
| Pseudocode | Yes | Algorithm 1 Low-rank matrix estimation via iterated stable autoencoding. ˆµ X Sjj Pn i=1 Var e X e Lδ(X) h e Xij i for all j = 1, ..., p while algorithm has not converged do b B ˆµ ˆµ + S 1 ˆµ ˆµ ˆµ X b B end while |
| Open Source Code | Yes | A software implementation of the proposed methods is available through the R-package denoise R (Josse et al., 2016). |
| Open Datasets | Yes | To do so, we examine the Rotten Tomatoes movie review dataset collected by Pang and Lee (2004), with n = 2, 000 documents and p = 50, 921 unique words. The data for the analysis was collected by asking consumers to describe 12 luxury perfumes such as Chanel Number 5 and J adore with words. The answers were then organized in a 12 39 (39 words unique were used) data matrix where each cell represents the number of times a word is associated to a perfume; a total of N = 1075 were used overall. The dataset is available at http://factominer.free.fr/docs/perfume.txt. |
| Dataset Splits | Yes | We trained the logistic regression on one half of the data and then tested it on the other half, repeating this process over 10,000 random splits. We used the full N = 1075 perfume dataset as the population dataset, and then generated samples of size N = 200 by subsampling the original dataset without replacement. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory, etc.) are mentioned in the paper. |
| Software Dependencies | No | The paper mentions the R-package denoise R but does not provide specific version numbers for R or the package itself. |
| Experiment Setup | Yes | As discussed in Section 2, we applied our stable autoencoding methods to X rather than X, so that n was larger than p; and set the tuning parameter to δ = 1/2. For ISA, we ran the iterative Algorithm 1 for 100 steps, although the algorithm appeared to become stable after 10 steps already. We used both SA and ISA to estimate µ from X; in both cases, we generated e X with the Poissoncompatible bootstrap noise model (8), and set δ = 1/2. For LN, SA and ISA, we set tuning parameters as in Section 5.2, namely LN uses ˆσ from (31), while SA is performed with δ = 0.5. For ISA we used δ = 0.3; this latter choice was made to get rank-2 estimates. |