Population Priors for Matrix Factorization
Authors: Sohrab Salehi, Achille Nazaret, Sohrab P Shah, David Blei
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate this approach with both synthetic and real-world data on diverse applications: movie ratings, book ratings, single-cell gene expression data, and musical preferences. Without needing to tune Bayesian hyperparameters, we find that the twin population prior leads to high-quality predictions, outperforming manually tuned priors. |
| Researcher Affiliation | Academia | Sohrab Salehi EMAIL Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA Irving Institute for Cancer Dynamics, Columbia University, New York, NY, USA Achille Nazaret EMAIL Department of Computer Science, Columbia University, New York, NY, USA Sohrab P Shah EMAIL Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA David M Blei EMAIL Department of Statistics, Columbia University, New York, NY, USA Data Science Institute, Columbia University, New York, NY, USA Department of Computer Science, Columbia University, New York, NY, USA |
| Pseudocode | Yes | Algorithm 1 Variational inference for Poisson matrix factorization with twin EB priors |
| Open Source Code | Yes | Our implementation is available at https://github.com/blei-lab/TwinEB. |
| Open Datasets | Yes | Movie Lens 1M. This dataset comprises 1 million ratings from 6,000 users (rows) on 4,000 movies (columns) (Harper & Konstan, 2015). Ru1322b. We analyze single cell gene expression data from a patient with small cell lung cancer (Chan et al., 2021). User Artists We use the data introduced in Cantador et al. (2011), comprising 92,834 user-listened artist relations, across 1,892 users and 17,632 artists, with a maximum value of 352, 698. Good Books Contains 6 million ratings across 51,288 users and 10,000 books1. Ratings range from 1 to 5. Sparsity of this dataset is 0.01. 1Accessed at https://github.com/zygmuntz/goodbooks-10k |
| Dataset Splits | Yes | We randomly assign 20% of the rows as the test set, and the rest as training data. We then mask 20% of the entries at random, and train the model on this train set using ten random restarts. We use the 20% masked entries as a validation set. At test time, we put aside 30% of the entries of the test rows at random these entries constitute the test set then we train the model on 40% of the rest of the entries. This procedure measures strong generalization (Steck, 2019). |
| Hardware Specification | Yes | We ran our experiment on a machine equipped with an NVIDIA A100 GPU with 80GB memory. |
| Software Dependencies | No | We implemented all methods in pytorch (Paszke et al., 2019). |
| Experiment Setup | Yes | We set a batch size of 128 in all our experiments. We ran Poisson and Gaussian matrix factorization experiments for a maximum of 20,000 iterations. By this step, all runs had converged. We initialized the learning rate for the row and column variables, rlr and clr separately. We fix the initial learning rate rlr {0.01} and clr {0.01}. In the experiments in the main text, we use 10 Monte Carlo samples to approximate the ELBO, while in the supplemental experiments, we use a single particle. |