Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering
Authors: Israel A. Almodóvar-Rivera, Ranjan Maitra
JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our methodology is always a top performer in identifying groups with regular and irregular structures in several datasets and can be applied to datasets with scatter or incomplete records. The approach is also used to identify the distinct kinds of gamma ray bursts in the Burst and Transient Source Experiment 4Br catalog and the distinct kinds of activation in a functional Magnetic Resonance Imaging study. Keywords: BATSE, DEMP, DEMP+, DBSCAN*, density peaks algorithm, GRB, GSLNN, k-clips, k-means, km-means, kernel density estimation, KNOB-Syn C, Mix Mod Combi, MGHD, MSAL, overlap, PGMM, SDSS, spectral clustering, Ti K-means |
| Researcher Affiliation | Academia | Israel A. Almodovar-Rivera EMAIL Department of Biostatistics and Epidemiology University of Puerto Rico at Medical Science Campus San Juan, PR 00936-5067, USA Ranjan Maitra EMAIL Department of Statistics Iowa State University Ames, IA, 50011-1090, USA |
| Pseudocode | Yes | Having provided theoretical development for the machinery that we will use, we now describe our multi-phased KNOB-Syn C algorithm: |
| Open Source Code | Yes | A R package called Syn Clust R implements our method in the function KNOBSyn C and the competing K-m H syncytial algorithm in the function km H and is publicly available at https://github.com/ialmodovar/Syn Clust R. |
| Open Datasets | Yes | The 2D Aggregation dataset of Gionis et al. (2007) has n = 788 observations from C = 7 groups of different characteristics. ... The E. coli dataset, publicly available from the University of California Irvine’s Machine Learning Repository (UCIMLR) (Newman et al., 1998), concerns identification of protein localization sites for the E. coli bacteria (Nakai and Kinehasa, 1991). ... The standard wine recognition dataset (Forina et al., 1988; S. Aeberhard and de Vel, 1992), also available from the UCIMLR contains p = 13 measurements on n = 178 wine samples... The olive oils dataset (Forina and Tiscornia, 1982; Forina et al., 1983) has measurements on 8 chemical components... The image segmentation dataset, also available from the UCIMLR, is on 19 attributes... The yeast protein localization dataset (Nakai, 1996), also obtained from the UCIMLR... The Acute Lymphoblastic Leukemia (ALL) training dataset of Yeoh et al. (2002) was used by Stuetzle and Nugent (2010)... The zipcode images (Stuetzle and Nugent, 2010) dataset consists of n = 2000 16 16 images... The Handwritten Pen-digits dataset (Alimoglu, 1996; Alimoglu and Alpaydin, 1996) available at the UCIMLR... We illustrate our methodology on the first 100 images of the Olivetti faces database (Samaria and Harter, 1994)... We illustrate our methodology on a subset (Wagstaff, 2004) of the Sloan Digital Sky Survey (SDSS) dataset... |
| Dataset Splits | No | The 50 homogeneous k-means groups in each of the five replicates when supplied to the merging phase each terminated with ˆC = 2 syncytial groups. For the first replicate, the largest group has 178307 (99.4%) voxels this is essentially the region of no activation. The other replicates have 178898 (99.7%), 178129 (99.3%), 179087 (99.8%), and 178658 (99.6%) voxels in this group. |
| Hardware Specification | No | No specific hardware details for experimental runs are mentioned. The paper discusses CPU intensity of certain methods but does not specify the hardware used by the authors for their own experiments. |
| Software Dependencies | Yes | MMC is implemented in the R (R Development Core Team, 2018) package RMix Mod Combi (Baudry and Celeux, 2014)... DBSCAN as implemented in the R package dbscan (Hahsler and Piekenbrock, 2018)... DP clustering (Rodriguez and Laio, 2014) as implemented in the R package density Clust (Pedersen et al., 2017)... PGMM (Mc Nicholas and Murphy, 2008) using the R package pgmm (Mc Nicholas et al., 2018)... MSAL using the R package Mix SAL (Franczak et al., 2018) and MGHD using the R package Mix GHD (Tortora et al., 2019)... R package RFASTf MRI (Almodovar-Rivera and Maitra, 2019)... |
| Experiment Setup | Yes | For each K {1, 2, . . . , Kmax}, obtain K-means partitions initialized each of n Kp times with K distinct seeds randomly chosen from the dataset and run to termination. The best in terms of the value of the objective function (WSS) at termination of each set of n Kp runs is our putative optimal K-means partition for that K {1, 2, . . . , Kmax}. We use Kmax = max{ n, 50}. ... In calculating the jump statistic, we have used y = p/2... The parameter κ determines the types of composite groups that are formed. For larger values of κ, we have groups formed by merging a few pairs at each iteration while smaller values κ prefer many simultaneous mergers. (For κ , no merging is possible.) ... For computational reasons also, we do not estimate K0 in the k-means phase but set it at K0 = 50. |