reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

Authors: Israel A. Almodóvar-Rivera, Ranjan Maitra

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our methodology is always a top performer in identifying groups with regular and irregular structures in several datasets and can be applied to datasets with scatter or incomplete records. The approach is also used to identify the distinct kinds of gamma ray bursts in the Burst and Transient Source Experiment 4Br catalog and the distinct kinds of activation in a functional Magnetic Resonance Imaging study. Keywords: BATSE, DEMP, DEMP+, DBSCAN*, density peaks algorithm, GRB, GSLNN, k-clips, k-means, km-means, kernel density estimation, KNOB-Syn C, Mix Mod Combi, MGHD, MSAL, overlap, PGMM, SDSS, spectral clustering, Ti K-means
Researcher Affiliation	Academia	Israel A. Almodovar-Rivera EMAIL Department of Biostatistics and Epidemiology University of Puerto Rico at Medical Science Campus San Juan, PR 00936-5067, USA Ranjan Maitra EMAIL Department of Statistics Iowa State University Ames, IA, 50011-1090, USA
Pseudocode	Yes	Having provided theoretical development for the machinery that we will use, we now describe our multi-phased KNOB-Syn C algorithm:
Open Source Code	Yes	A R package called Syn Clust R implements our method in the function KNOBSyn C and the competing K-m H syncytial algorithm in the function km H and is publicly available at https://github.com/ialmodovar/Syn Clust R.
Open Datasets	Yes	The 2D Aggregation dataset of Gionis et al. (2007) has n = 788 observations from C = 7 groups of different characteristics. ... The E. coli dataset, publicly available from the University of California Irvine’s Machine Learning Repository (UCIMLR) (Newman et al., 1998), concerns identification of protein localization sites for the E. coli bacteria (Nakai and Kinehasa, 1991). ... The standard wine recognition dataset (Forina et al., 1988; S. Aeberhard and de Vel, 1992), also available from the UCIMLR contains p = 13 measurements on n = 178 wine samples... The olive oils dataset (Forina and Tiscornia, 1982; Forina et al., 1983) has measurements on 8 chemical components... The image segmentation dataset, also available from the UCIMLR, is on 19 attributes... The yeast protein localization dataset (Nakai, 1996), also obtained from the UCIMLR... The Acute Lymphoblastic Leukemia (ALL) training dataset of Yeoh et al. (2002) was used by Stuetzle and Nugent (2010)... The zipcode images (Stuetzle and Nugent, 2010) dataset consists of n = 2000 16 16 images... The Handwritten Pen-digits dataset (Alimoglu, 1996; Alimoglu and Alpaydin, 1996) available at the UCIMLR... We illustrate our methodology on the first 100 images of the Olivetti faces database (Samaria and Harter, 1994)... We illustrate our methodology on a subset (Wagstaff, 2004) of the Sloan Digital Sky Survey (SDSS) dataset...
Dataset Splits	No	The 50 homogeneous k-means groups in each of the five replicates when supplied to the merging phase each terminated with ˆC = 2 syncytial groups. For the first replicate, the largest group has 178307 (99.4%) voxels this is essentially the region of no activation. The other replicates have 178898 (99.7%), 178129 (99.3%), 179087 (99.8%), and 178658 (99.6%) voxels in this group.
Hardware Specification	No	No specific hardware details for experimental runs are mentioned. The paper discusses CPU intensity of certain methods but does not specify the hardware used by the authors for their own experiments.
Software Dependencies	Yes	MMC is implemented in the R (R Development Core Team, 2018) package RMix Mod Combi (Baudry and Celeux, 2014)... DBSCAN as implemented in the R package dbscan (Hahsler and Piekenbrock, 2018)... DP clustering (Rodriguez and Laio, 2014) as implemented in the R package density Clust (Pedersen et al., 2017)... PGMM (Mc Nicholas and Murphy, 2008) using the R package pgmm (Mc Nicholas et al., 2018)... MSAL using the R package Mix SAL (Franczak et al., 2018) and MGHD using the R package Mix GHD (Tortora et al., 2019)... R package RFASTf MRI (Almodovar-Rivera and Maitra, 2019)...
Experiment Setup	Yes	For each K {1, 2, . . . , Kmax}, obtain K-means partitions initialized each of n Kp times with K distinct seeds randomly chosen from the dataset and run to termination. The best in terms of the value of the objective function (WSS) at termination of each set of n Kp runs is our putative optimal K-means partition for that K {1, 2, . . . , Kmax}. We use Kmax = max{ n, 50}. ... In calculating the jump statistic, we have used y = p/2... The parameter κ determines the types of composite groups that are formed. For larger values of κ, we have groups formed by merging a few pairs at each iteration while smaller values κ prefer many simultaneous mergers. (For κ , no merging is possible.) ... For computational reasons also, we do not estimate K0 in the k-means phase but set it at K0 = 50.