reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Finding Hubs in High Dimensions with Sampling

Authors: Huiwen Dong, Linghan Zeng, Zhiwen Zhao, Francesco Silvestri, Ninh Pham

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, by sampling only 10% of points, Sam Hub runs significantly faster and offers higher accuracy than existing hub detection methods on many real-world data sets with dot product, L1, L2, and dynamic time warping distances. Our ablation studies of Sam Hub on improving k NN-based classification show potential for other high-dimensional data analysis tasks.
Researcher Affiliation	Academia	Huiwen Dong1, Linghan Zeng2, Zhiwen Zhao1, Francesco Silvestri3, Ninh Pham2 1Faculty of Artificial Intelligence, Beijing Normal University, Guangdong, China 2School of Computer Science, University of Auckland, New Zealand 3Department of Information Engineering, University of Padova, Italy
Pseudocode	Yes	Algorithm 1: Sam Hub Require: Data set X Rd of n points, s samples, k nearest neighbors, top-h hubs, a distance measure Filtering step: 1: Uniformly sample s points from X to form S 2: For each yi S, compute the k NN list k NN(yi, X) 3: Form the histogram H1 where the counter of xj is the frequency of xj on the s k NN lists in Line 2 4: Select top-s points with the highest frequency in H1 to form the candidate set C Verifying step: 5: For each xj X, compute the k NN list k NN(xj, C) 6: Form the histogram H2 where the counter of xj is the frequency of xj in the n k NN lists in Line 5 7: Return top-h points with the highest frequency as top-h hubs and compute the estimated skewness based on H2 Ensure: The top-h hubs and the estimated skewness score
Open Source Code	Yes	We implement Sam Hub with a few lines of Python codes1. 1https://github.com/Annie Dong-code/Sam Hub.git
Open Datasets	Yes	We conduct experiments on many realworld data sets in Table 1.
Dataset Splits	No	The paper mentions running experiments on real-world datasets and states that "All results are the average of 5 runs of the algorithms." However, it does not specify the train/test/validation splits or methodology used for these datasets.
Hardware Specification	Yes	We conduct experiments on a 3.2 GHz Core i7-5800 CPU with 64GB of RAM.
Software Dependencies	No	The paper mentions implementing Sam Hub with Python code and using ANNS libraries like Faiss, Hnswlib, and Sca NN. However, it does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	For all experiments, we set h = 10 and vary k and s. Figure 2 shows the accuracy of Sam Hub on many real-world data sets with various skewness while varying the number of samples s = {2%, . . . , 20%} n. The CNAE, Mini-ngs, Gisette, and Optdigits data sets exhibit skewness of 5.90, 8.45, 4.66, and 1.02 on L2.