reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

General Latent Feature Models for Heterogeneous Datasets

Authors: Isabel Valera, Melanie F. Pradier, Maria Lomeli, Zoubin Ghahramani

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show the ﬂexibility of the proposed model by solving both prediction and data analysis tasks on several real-world datasets. [...] In this section, we apply the proposed model to solve two diﬀerent tasks on several realworld datasets. In Section 5.1, we focus on a prediction task in which we aim to estimate and replace the missing data, which is assumed to be missing completely at random. [...] In Section 5.2, we focus on a data analysis task on several real-world datasets from diﬀerent application domains such as medicine, psychiatry, clinical trials and politics.
Researcher Affiliation	Collaboration	Isabel Valera EMAIL Department of Computer Science Saarland University Saarbr ucken, Germany; and Max Planck Institute for Intelligent Systems T ubingen, Germany; Melanie F. Pradier EMAIL School of Engineering and Applied Sciences Harvard University Cambridge, USA Maria Lomeli EMAIL Department of Engineering University of Cambridge Cambridge, UK Zoubin Ghahramani EMAIL Department of Engineering University of Cambridge Cambridge, UK; and Uber AI, San Francisco, California, USA
Pseudocode	Yes	Algorithm 1 Inference Algorithm. Input: X Initialize: Z and {Yd}D d=1 1: for each iteration do 2: Update Z given {Yd}D d=1 as detailed in Section 4.1. 3: for d = 1, . . . , D do 4: Sample Bd given Z and Yd according to (9). 5: Sample Yd given X, Z and Bd as shown in Section 4.2. 6: Sample Ψd (if needed) as shown in Section 4.2. 8: end for Output: Z, {Bd}D d=1 and {Ψd}D d=1
Open Source Code	Yes	Finally, a software package, called GLFM toolbox, is made publicly available for other researchers to use and extend. It is available at https://ivaleram.github.io/GLFM/. [...] The source software package is publicly available at https://github.com/ivalera M/ GLFM, that provides users with the necessary functions and scripts to use the GLFM for both missing data estimation and data exploration tasks.
Open Datasets	Yes	We evaluate the predictive power of the proposed model at estimating missing data on ﬁve real datasets, which are summarized in Table 1. The datasets contain diﬀerent numbers of objects and attributes, which cover all the discrete and continuous variables described in Section 3. ... Statlog German credit dataset (Eggermont et al., 2004) ... QSAR biodegradation dataset (Mansouri et al., 2013) ... Internet usage survey dataset (Centre, 2014) ... Wine quality dataset (Cortez et al., 2009) ... Nesarc dataset (Ruiz et al., 2013)
Dataset Splits	Yes	Each value in Figure 2 was obtained by averaging the results across 20 independently split sets where the missing values were randomly chosen.
Hardware Specification	No	No specific hardware details (like GPU/CPU models or memory) were mentioned for running experiments. The paper focuses on the software implementation and theoretical aspects of the model, along with experimental results on various datasets.
Software Dependencies	No	The core inference algorithm is developed in C++, and the corresponding user interfaces are provided in Matlab, Python and R. [...] Finally, our implementation of the GLFM makes use of the GNU Scientiﬁc Library (GSL). No specific version numbers were provided for C++, Matlab, Python, R, or GSL.
Experiment Setup	Yes	In the GLFM model, for real positive and/or count data, we consider the following transformation that maps from the real numbers to the real positive numbers, f(x) = log(exp(wx) + 1). We select the parameter w such that the data is scaled to a common range. For each dataset we run 5,000 iterations of the proposed MCMC sampler from Section 4. ... In our experiments, we sample the variance of the pseudo-observations in each dimension and choose the parameter values as follows: α = 5, σ2 B = 1, and σ2 θ = 1. We also consider the following transformation that maps from the real numbers to the positive real numbers, for the positive real and count data: f(x) = log(w (x µ) + 1), where µ = min(xd) and w = 2/std(xd) are data-driven parameters whose objective is to shift and scale the data. In order to obtain more interpretable results, we also activated the bias term, as explained in Section 5.2.1.