Optimal Transfer Learning for Missing Not-at-Random Matrix Completion
Authors: Akhil Jalan, Yassir Jedra, Arya Mazumdar, Soumendu Sundar Mukherjee, Purnamrita Sarkar
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach through comparisons with existing algorithms on real-world biological datasets. ... Next, we compare our methods against existing algorithms on real-world and simulated datasets. ... We compare against two baselines from the matrix completion literature. ... We perform ablation studies to test the effect of other model parameters, such as rank, dimension, noise variance, etc. |
| Researcher Affiliation | Academia | 1Department of Computer Science, UT Austin, USA 2Laboratory for Information & Decision Systems (LIDS), MIT, USA 3Halıcıoğlu Data Science Institute & Department of Computer Science and Engineering, UC San Diego, USA 4Statistics and Mathematics Unit (SMU), Indian Statistical Institute, Kolkata, India 5Department of Statistics and Data Sciences, UT Austin, USA. Correspondence to: Akhil Jalan <EMAIL>. |
| Pseudocode | No | The paper describes the estimation framework and active sampling steps in prose and numbered lists within paragraphs (e.g., 'Least Squares Estimator. 1. Extract features via SVD... 2. Then solve... 3. Estimate Q:', and 'Active Sampling. Given U, V, and budget Trow,Tcol, 1. Compute ϵ-approximate G-optimal designs... 2. Sample i1,...i Trow...'), but does not provide a distinct, structured pseudocode or algorithm block labeled as such. |
| Open Source Code | No | The paper states: 'See Appendix B for precise details of our implementations.' and 'We run all experiments on a Linux machine with 378GB of CPU/RAM. The total compute time across all results in the paper was less than 4 hours.'. This indicates that code was implemented for the experiments but there is no explicit statement of open-sourcing the code, nor a link to a code repository. |
| Open Datasets | Yes | We study real-world datasets on gene expression microarrays in a whole-blood sepsis study (Parnell et al., 2013), and weighted metabolic networks of gram-negative bacteria (King et al., 2016). ... For the gene expression experiments, we gather whole-blood sepsis gene expression data sampled by (Parnell et al., 2013), available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse54514. ... For the metabolic networks experiments, we access the Bi GG genome-scale metabolic models datasets (King et al., 2016) at http://bigg.ucsd.edu. |
| Dataset Splits | No | The paper refers to varying "masking probabilities on e Q" (e.g., pRow = pCol) to simulate missing data, and "random sampling from the G-optimal design" for active sampling. However, it does not provide explicit training, validation, or test dataset splits in terms of percentages, absolute counts, or specific predefined files, which are typically used for model evaluation and reproduction. |
| Hardware Specification | No | We run all experiments on a Linux machine with 378GB of CPU/RAM. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. It mentions 'See Appendix B for precise details of our implementations,' but these details do not include software versions. |
| Experiment Setup | Yes | We compute the active sampling estimator by fixing the budgets Trow =m p Row,Tcol =n p Col throughout. ... Here, e Q has p Row = p Col varying along the x-axis, which displays p2 Row. We set σQ =0.1, and P is fully observed. ... We train not-MIWAE until convergence, with the latent dimension equal to the true matrix rank of Q, and a batch size of 32. ... The default settings are: Matrices P,Q Rm n with m=300,n=200. The parameters a=0.8,b=0.1 in the Partitioned Matrix Model. Additive noise for e Q is iid N(0,σ2 Q) with σQ =0.1. The rank is d=5. p Row =p Col =0.5, so the probability of seeing any entry of Q is 0.25. |