reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dirichlet Process-Based Robust Clustering Using the Median-of-Means Estimator

Authors: Supratik Basu, Jyotishka Ray Choudhury, Debolina Paul, Swagatam Das

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Statistical guarantees on an upper bound of clustering error and rigorous assessment through simulated and real datasets suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
Researcher Affiliation	Academia	1Department of Statistical Science, Duke University, USA 2H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, USA 3Machine Learning Research Laboratory, ECSU, Indian Statistical Institute, Kolkata, India 4Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata, India
Pseudocode	Yes	Algorithm 1 Dirichlet Process Clustering using Median-of Means (DP-Mo M)
Open Source Code	Yes	Codes can be found at https://github.com/jyotishkarc/DP-Mo M.
Open Datasets	Yes	Our first experiment involves implementing the aforementioned techniques on several datasets from the UCI Machine Learning Repository1 and the Compcancer database2.
Dataset Splits	No	The paper describes simulation studies where data points are generated and outliers are introduced in stages, and mentions running the randomized algorithm 35 times. However, it does not provide specific training/test/validation dataset splits with percentages, absolute counts, or references to predefined standard splits for reproduction of the experiments.
Hardware Specification	Yes	The simulation experiments were conducted using a computer equipped with Intel(R) Core(TM) i3-7020U 2.30GHz processor, 4GB RAM, 64-bit Windows 10 operating system in the R programming language [R Core Team, 2022].
Software Dependencies	Yes	The simulation experiments were conducted using a computer equipped with Intel(R) Core(TM) i3-7020U 2.30GHz processor, 4GB RAM, 64-bit Windows 10 operating system in the R programming language [R Core Team, 2022].
Experiment Setup	Yes	The tuning parameter ε is set to 1. The learning rate η is typically chosen to be the power of 10 which is of the order of the squared maximum pairwise distance in the dataset, or one lower than that i.e. if the maximum squared separation between any two observations in the data is D, then we set η = 10 log10 D/2 or 10 log10 D/2 1 depending on which of these values aids efficient clustering using our proposed method, where represents the ceiling function.