Dirichlet Process-Based Robust Clustering Using the Median-of-Means Estimator

Authors: Supratik Basu, Jyotishka Ray Choudhury, Debolina Paul, Swagatam Das

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Statistical guarantees on an upper bound of clustering error and rigorous assessment through simulated and real datasets suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
Researcher Affiliation Academia 1Department of Statistical Science, Duke University, USA 2H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, USA 3Machine Learning Research Laboratory, ECSU, Indian Statistical Institute, Kolkata, India 4Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata, India
Pseudocode Yes Algorithm 1 Dirichlet Process Clustering using Median-of Means (DP-Mo M)
Open Source Code Yes Codes can be found at https://github.com/jyotishkarc/DP-Mo M.
Open Datasets Yes Our first experiment involves implementing the aforementioned techniques on several datasets from the UCI Machine Learning Repository1 and the Compcancer database2.
Dataset Splits No The paper describes simulation studies where data points are generated and outliers are introduced in stages, and mentions running the randomized algorithm 35 times. However, it does not provide specific training/test/validation dataset splits with percentages, absolute counts, or references to predefined standard splits for reproduction of the experiments.
Hardware Specification Yes The simulation experiments were conducted using a computer equipped with Intel(R) Core(TM) i3-7020U 2.30GHz processor, 4GB RAM, 64-bit Windows 10 operating system in the R programming language [R Core Team, 2022].
Software Dependencies Yes The simulation experiments were conducted using a computer equipped with Intel(R) Core(TM) i3-7020U 2.30GHz processor, 4GB RAM, 64-bit Windows 10 operating system in the R programming language [R Core Team, 2022].
Experiment Setup Yes The tuning parameter ε is set to 1. The learning rate η is typically chosen to be the power of 10 which is of the order of the squared maximum pairwise distance in the dataset, or one lower than that i.e. if the maximum squared separation between any two observations in the data is D, then we set η = 10 log10 D/2 or 10 log10 D/2 1 depending on which of these values aids efficient clustering using our proposed method, where represents the ceiling function.