reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cluster Tree for Nearest Neighbor Search

Authors: Dan Kushnir, Sandeep Silwal

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental evaluations on real world datasets demonstrate improvements over RP trees and other tree based methods for NNS while maintaining efficient construction time. In addition, we show theoretically and empirically that Cluster Tree finds partitions which are superior to those found by RP trees in preserving the cluster structure of the input dataset.
Researcher Affiliation	Collaboration	Dan Kushnir EMAIL Bell Laboratories, Murray Hill, NJ NOKIA Sandeep Silwal EMAIL Department of Computer Science University of Wisconsin-Madison
Pseudocode	Yes	Algorithm 1 One DProjection(X, T, k) Algorithm 2 Make Cluster Tree(X, P, T, k) Algorithm 3 Query(q, T )
Open Source Code	No	The paper does not contain any explicit statement about releasing the source code for the Cluster Tree methodology. Mentions of code for third-party tools (like FALCONN for LSH) are present, but not for the authors' own work.
Open Datasets	Yes	We use the following datasets which have been used in previous machine learning works on clustering and nearest neighbor search (for example the works of Dong et al. (2020); Keivani & Sinha (2021); Lucic et al. (2018), and Bachem et al. (2018)): KDD Cup (clustering dataset from a Quantum physics task) (kdd, 2004), News (dataset of news text where each feature represents if a key word is included) (Rennie, 2016), Spam (spam text where each feature represents the presence of a particular word associated with spam) (van Rijn, 2016), SIFT (image descriptors) (Aumüller et al., 2017), and Gaussian Mixtures. See Table 1. Due to space limitation we provide experiments with additional data sets in appendix E, and F.
Dataset Splits	No	The paper describes how queries are evaluated against the constructed trees (e.g., 'fraction of actual k-nearest neighbors among the returned candidates') and discusses parameters like 'leaf parameter' and 'candidate sizes'. However, it does not explicitly provide details about how the original datasets were split into training, validation, or test sets (e.g., specific percentages or sample counts for building the tree vs. querying it).
Hardware Specification	No	The paper mentions runtime in seconds for tree construction on a dense dataset (SIFT) but does not specify any particular hardware components like CPU or GPU models used for the experiments. It only states: 'Even on a dense dataset (SIFT) of 106 point in dimension 128 with the leaf parameter set to 5 103, the runtimes to create an RP tree was 11.8 seconds on average whereas Cluster Tree took 45.2 seconds.'
Software Dependencies	No	The paper mentions using 'the classic k-means algorithm' and 'the Cross-Polytope version from Andoni et al. (2015; 2018a)' for baselines, but it does not specify any programming languages, libraries, or solvers with their corresponding version numbers used in their implementation or experiments.
Experiment Setup	Yes	Parameter Selection. In all of our experiments, we use a fixed value of T = 20 random projections in Algorithm 1. For the value of k, we initialize k = 20 and keep increasing k by one until the value of the normalized cut found stops decreasing. Intuitively, it ensures we don t overlook a potentially better cut. Empirically, we observed that this only iterates over a few values ( 5) of k. Evaluation Metric. As in other NNS works, we measure the number of candidates returned for queries versus the k-NN accuracy, which is defined to be the fraction of its actual k-nearest neighbors that are among the returned candidates (Dong et al., 2020). We created instances of Cluster Tree and RP trees for all of our datasets where we set the leaf size, the parameter P in Algorithm 2, to be equal to 10% of n.