reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Matrix Completion with Covariate Information and Informative Missingness

Authors: Huaqing Jin, Yanyuan Ma, Fei Jiang

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The method is demonstrated via simulation studies and is used to analyze a Yelp data set and a Movie Lens data set.
Researcher Affiliation	Academia	Huaqing Jin EMAIL Department of Statistics and Actuarial Science The University of Hong Kong, Hong Kong; Yanyuan Ma EMAIL Department of Statistics Pennsylvania State University; Fei Jiang EMAIL Department of Epidemiology and Biostatistics The University of California, San Francisco
Pseudocode	No	The paper describes the computational algorithm in Section 4 (Computational Algorithm and Convergence), outlining the steps for updating β and Θ, but it does so in narrative text and mathematical expressions rather than a structured pseudocode block or algorithm environment.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the methodology, nor does it provide any links to a code repository.
Open Datasets	Yes	The method is demonstrated via simulation studies and is used to analyze a Yelp data set and a Movie Lens data set. The data set is available at https://www.yelp.com/dataset/documentation/main. We compare the performance of our method with that of the WCF method proposed in Hu et al. (2008) on Movie Lens 1M data set, which includes one million ratings from 6040 users and 3952 movies. Movie Lens (https://grouplens.org/datasets/movielens/)
Dataset Splits	No	The paper describes a process for introducing additional missingness into the data for evaluation purposes (e.g., 'We ﬁrst remove the observed Yij s with probabilities p1 and p0 for Yij = 1 and Yij = 0, respectively, where p0 and p1 with p0 = p1 are chosen so that an additional α100% missingness is introduced into the data.') to simulate an MNAR scenario, but it does not provide explicit train/test/validation splits for the overall datasets used to reproduce model training.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as CPU or GPU models, or memory specifications.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers, such as programming languages, libraries, or frameworks.
Experiment Setup	Yes	We design Θ0 to be a rank 5 matrix with singular values (10, 1.8, 1.6, 1.4, 1.2), and β0 = (1, 0, 2, 0, 3, 4, 5, 0, . . . , 0)T. We set m = n, and vary m, n from 100 to 1600. Furthermore, we generate Rij from Pr(Rij = 1\|Yij) = expit(Yij Y D), where Y = P i,j Yij/(mn) and D is chosen to achieve 90% missingness in the data. As speciﬁed in Theorems 1 and 2, we select λβ = log{max(p, mn)}/mn and λΘ = CΘ max np log(d)/d, log{max(p, mn)}1/4 d o , where Cβ and CΘ are constants chosen to achieve similar sparseness of the estimators across all situations. In the implementation, we use the Monte Carlo method to approximate the integration in the loss function, while the distribution of Xij is estimated empirically. At the tth iteration, we sample M copies of βt TXij from the empirical distribution, and approximate the integration.