reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Regularized Joint Mixture Models

Authors: Konstantinos Perrakis, Thomas Lartigue, Frank Dondelinger, Sach Mukherjee

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate the key ideas via empirical examples. An R package is available at https://github.com/k-perrakis/regjmix. ... In Section 5 we present empirical examples, focusing initially on small-scale simulations and then proceeding to larger scale semi-synthetic experiments and applications to real data.
Researcher Affiliation	Academia	Konstantinos Perrakis EMAIL Department of Mathematical Sciences Durham University, UK; Thomas Lartigue EMAIL Aramis Project Team, Inria & Center of Applied Mathematics, CNRS, Ecole Polytechnique, IP Paris, France; Frank Dondelinger EMAIL Lancaster Medical School Lancaster UK; Sach Mukherjee EMAIL German Center for Neurodegenerative Diseases, Bonn, Germany & MRC Biostatistics Unit, University of Cambridge, UK
Pseudocode	No	The paper describes the expectation and maximization steps of the EM algorithm in Section 3.1 with mathematical equations (e.g., 'The E-Step.' and 'The M-Step.'). However, it does not present a clearly labeled 'Pseudocode' or 'Algorithm' block with structured, step-by-step instructions. The descriptions are provided in paragraph form supported by equations.
Open Source Code	Yes	An R package is available at https://github.com/k-perrakis/regjmix. ... The RJM methods presented in this paper are implemented as an R package regjmix, available at https://github.com/k-perrakis/regjmix.
Open Datasets	Yes	The simulations presented below are based on data from the The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov). ... We use the TCGA data as introduced above, with gene expression levels treated as features.
Dataset Splits	Yes	For all simulations we use n 250, balanced group sample sizes, i.e. nk 125 for k 1, 2... We use 80% of the samples for training and 20% for testing.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or cloud instance specifications used for running the experiments. It mentions various R packages for software, but no hardware.
Software Dependencies	No	The paper mentions several R packages used (e.g., 'R package glasso Fast', 'glmnet', 'mclust', 'Mo EClust', 'flexmix', 'cluster', 'Swarm SVM') and cites their respective papers. However, it does not explicitly state specific version numbers for these software components or the R environment itself, as required by the criteria.
Experiment Setup	Yes	For all simulations we use n 250, balanced group sample sizes, i.e. nk 125 for k 1, 2, and varying dimensionality for the features; namely, i) p 100 (n ą p problem), ii) p 250 (n p problem) and iii) p 500 (n ă p problem). ... As a default option we use ten EM starts. For the termination of the algorithm we use a combination of two criteria that are commonly used in practice. The first is to simply set a maximum number p Tq of EM iterations. Empirical results suggest that the option T 20 is sufficient. The second criterion takes into account the relative change in the objective function in (15); namely, the algorithm is stopped when ˇˇˇˇˇ Qpθ, τ, λ\|θptq, τ ptq, λptqq Qpθ, τ, λ\|θpt 1q, τ pt 1q, λpt 1qq 1 ˇˇˇˇˇ ă ϵ, using as default option ϵ 10 6.