reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Optimal Transfer Learning for Missing Not-at-Random Matrix Completion

Authors: Akhil Jalan, Yassir Jedra, Arya Mazumdar, Soumendu Sundar Mukherjee, Purnamrita Sarkar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our approach through comparisons with existing algorithms on real-world biological datasets. ... Next, we compare our methods against existing algorithms on real-world and simulated datasets. ... We compare against two baselines from the matrix completion literature. ... We perform ablation studies to test the effect of other model parameters, such as rank, dimension, noise variance, etc.
Researcher Affiliation	Academia	1Department of Computer Science, UT Austin, USA 2Laboratory for Information & Decision Systems (LIDS), MIT, USA 3Halıcıoğlu Data Science Institute & Department of Computer Science and Engineering, UC San Diego, USA 4Statistics and Mathematics Unit (SMU), Indian Statistical Institute, Kolkata, India 5Department of Statistics and Data Sciences, UT Austin, USA. Correspondence to: Akhil Jalan <EMAIL>.
Pseudocode	No	The paper describes the estimation framework and active sampling steps in prose and numbered lists within paragraphs (e.g., 'Least Squares Estimator. 1. Extract features via SVD... 2. Then solve... 3. Estimate Q:', and 'Active Sampling. Given U, V, and budget Trow,Tcol, 1. Compute ϵ-approximate G-optimal designs... 2. Sample i1,...i Trow...'), but does not provide a distinct, structured pseudocode or algorithm block labeled as such.
Open Source Code	No	The paper states: 'See Appendix B for precise details of our implementations.' and 'We run all experiments on a Linux machine with 378GB of CPU/RAM. The total compute time across all results in the paper was less than 4 hours.'. This indicates that code was implemented for the experiments but there is no explicit statement of open-sourcing the code, nor a link to a code repository.
Open Datasets	Yes	We study real-world datasets on gene expression microarrays in a whole-blood sepsis study (Parnell et al., 2013), and weighted metabolic networks of gram-negative bacteria (King et al., 2016). ... For the gene expression experiments, we gather whole-blood sepsis gene expression data sampled by (Parnell et al., 2013), available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse54514. ... For the metabolic networks experiments, we access the Bi GG genome-scale metabolic models datasets (King et al., 2016) at http://bigg.ucsd.edu.
Dataset Splits	No	The paper refers to varying "masking probabilities on e Q" (e.g., pRow = pCol) to simulate missing data, and "random sampling from the G-optimal design" for active sampling. However, it does not provide explicit training, validation, or test dataset splits in terms of percentages, absolute counts, or specific predefined files, which are typically used for model evaluation and reproduction.
Hardware Specification	No	We run all experiments on a Linux machine with 378GB of CPU/RAM.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers. It mentions 'See Appendix B for precise details of our implementations,' but these details do not include software versions.
Experiment Setup	Yes	We compute the active sampling estimator by fixing the budgets Trow =m p Row,Tcol =n p Col throughout. ... Here, e Q has p Row = p Col varying along the x-axis, which displays p2 Row. We set σQ =0.1, and P is fully observed. ... We train not-MIWAE until convergence, with the latent dimension equal to the true matrix rank of Q, and a batch size of 32. ... The default settings are: Matrices P,Q Rm n with m=300,n=200. The parameters a=0.8,b=0.1 in the Partitioned Matrix Model. Additive noise for e Q is iid N(0,σ2 Q) with σQ =0.1. The rank is d=5. p Row =p Col =0.5, so the probability of seeing any entry of Q is 0.25.