Robust Hierarchical Clustering

Authors: Maria-Florina Balcan, Yingyu Liang, Pramod Gupta

JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations on synthetic and real world data sets show that our algorithm achieves better performance than other hierarchical algorithms in the presence of noise. Keywords: unsupervised learning, clustering, agglomerative algorithms, robustness
Researcher Affiliation Collaboration Maria-Florina Balcan EMAIL School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA Yingyu Liang EMAIL Department of Computer Science Princeton University Princeton, NJ 08540, USA Pramod Gupta EMAIL Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043, USA
Pseudocode Yes Algorithm 1 Robust Median Neighborhood Linkage Algorithm 2 Inductive Robust Median Neighborhood Linkage Algorithm 3 Implementation Details of Robust Median Neighborhood Linkage
Open Source Code No The paper does not provide any explicit statements about releasing their code or direct links to a code repository for the methodology described.
Open Datasets Yes Data sets To emphasize the effect of noise on different algorithms, we perform controlled experiments on a synthetic data set AIStat. This data set contains 512 points. It is an instance of the example discussed in Section 4 and is described in Figure 7. We further consider the following real-world data sets from UCI Repository (Bache and Lichman, 2013): Wine, Iris, BCW (Breast Cancer Wisconsin), BCWD (Breast Cancer Wisconsin Diagnostic), Spambase, and Mushroom. We also consider the MNIST data set (Le Cun et al., 1998) and use two subsets of the test set for our experiments: Digits0123 that contains the examples of the digits 0, 1, 2, 3, and Digits4567 that contains the examples of the digits 4, 5, 6, 7. We additionally consider the 10 data sets (PFAM1 to PFAM10) (Voevodski et al., 2012), which are created by randomly choosing 8 families (of size between 1000 and 10000) from the biology database Pfam (Punta et al., 2012), version 24.0, October 2009.
Dataset Splits No The paper mentions using subsets of the MNIST test set (Digits0123, Digits4567) and repeating sampling for 5 times in the inductive setting, but it does not specify concrete train/validation/test splits, percentages, or predefined split references for general experimental reproduction.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions "biological sequence alignment software BLAST (Altschul et al., 1990)" but does not specify version numbers for any software used to implement or run their algorithms.
Experiment Setup Yes Input: Similarity function K on a set of points S, n = |S|, noise parameters α > 0, ν > 0. Step 1 Initialize t = 6(α + ν)n + 1. Initialize C t 1 to be a list of blobs so that each point is in its own blob. For all algorithms accepting input parameters (including (Generalized) Wisharts Method, CURE, and RMNL), the experiments are repeated on the same data over a range of input parameter values, and the best results are considered. ... The x-axis represents the value of α; (b) Fix α = 1/32 and vary ν from 0 to 1/32. The x-axis represents the value of ν; (c) Vary α from 1/32 to 1/16, and vary ν from 0 to 1/32. The x-axis represents the value of α + ν.