reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Low-Rank Doubly Stochastic Matrix Decomposition for Cluster Analysis

Authors: Zhirong Yang, Jukka Corander, Erkki Oja

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The new method is compared against many other clustering methods on various realworld data sets. The results show that the DCD method can often produce more accurate clusterings, especially for large-scale manifold data sets containing up to hundreds of thousands of samples. We also demonstrate that it can select the number of clusters much more precisely than other existing approaches. ... Experimental results showed that our method works robustly for various selected data sets and can substantially improve clustering accuracy for large manifold data sets.
Researcher Affiliation	Academia	Zhirong Yang EMAIL Helsinki Institute of Information Technology HIIT University of Helsinki, Finland Jukka Corander EMAIL Department of Mathematics and Statistics Helsinki Institute for Information Technology HIIT University of Helsinki, Finland Department of Biostatistics, University of Oslo, Norway Erkki Oja EMAIL Department of Computer Science Aalto University, Finland
Pseudocode	Yes	Algorithm 1 Relaxed MM Algorithm for DCD
Open Source Code	No	The paper mentions using a modified code for a vantage-point index and refers to FLANN (Fast Library of Approximated Nearest Neighbors) as a tool for KNN. For the vantage-point index, it states: "We have modiﬁed and used the code in http://stevehanov.ca/blog/index.php?id=130." This link refers to a component used in their work, not the source code for the DCD methodology described in the paper. There is no explicit statement or link providing the source code for the DCD method itself.
Open Datasets	Yes	We have compared the above methods on 43 data sets from various domains, including biology, image, video, text, remote sensing, etc. All data sets are publicly available on the Internet. The data sources and statistics are given in the supplemental document. ... For text document data set 20NG, DCD achieves comparable accuracy to those with comprehensive feature engineering and supervised classiﬁcation (e.g. Srivastava et al., 2013)... MNIST2, though our method does not use any class labels. 2. see http://yann.lecun.com/exdb/mnist/
Dataset Splits	No	The paper describes using K-Nearest-Neighbor graphs constructed from multivariate data for similarity-based clustering. It outlines how the similarity matrix S is obtained for clustering tasks (e.g., K=10 NN graphs). However, for the purpose of reproducing experiments, it does not specify explicit training/test/validation splits, proportions, or seeds, as clustering methods typically operate on the full dataset for evaluation against ground truth labels.
Hardware Specification	No	We acknowledge the computational resources provided by the Aalto Science-IT project. This statement indicates the source of computational resources but lacks any specific details about the hardware used (e.g., GPU models, CPU types, memory, etc.).
Software Dependencies	No	We used their implementation in Matlab. ... We have used a simple implementation with a vantage-point index (Yianilos, 1993), where we slightly modiﬁed the code to admit sparse data and with interface to Matlab. ... Fast Library of Approximated Nearest Neighbors (FLANN; Muja and Lowe, 2014). The paper mentions using Matlab and the FLANN library, but it does not specify any version numbers for these software components, which is necessary for reproducible software dependencies.
Experiment Setup	Yes	In our implementation, we add a small positive perturbation (e.g. 0.2) to all entries of the initial cluster indicator matrix. Next, the perturbed matrix is fed to our optimization algorithm (with α = 1 in Algorithm 1). Among all runs of DCD, we return the clustering result with the smallest D(S\|\|M). ... The NMF-type methods were run with maximum 10,000 iterations of multiplicative updates and with convergence tolerance 10^-6. ... For similarity-based clustering methods, we constructed K-Nearest-Neighbor graphs from the multivariate data with K = 10. ... By simply replacing 10NN with 5NN as the input similarities, DCD respectively selects 21 for COIL20 and 99 for COIL100 as the best number of clusters.