On Finding Hubs in High Dimensions with Sampling
Authors: Huiwen Dong, Linghan Zeng, Zhiwen Zhao, Francesco Silvestri, Ninh Pham
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, by sampling only 10% of points, Sam Hub runs significantly faster and offers higher accuracy than existing hub detection methods on many real-world data sets with dot product, L1, L2, and dynamic time warping distances. Our ablation studies of Sam Hub on improving k NN-based classification show potential for other high-dimensional data analysis tasks. |
| Researcher Affiliation | Academia | Huiwen Dong1, Linghan Zeng2, Zhiwen Zhao1, Francesco Silvestri3, Ninh Pham2 1Faculty of Artificial Intelligence, Beijing Normal University, Guangdong, China 2School of Computer Science, University of Auckland, New Zealand 3Department of Information Engineering, University of Padova, Italy |
| Pseudocode | Yes | Algorithm 1: Sam Hub Require: Data set X Rd of n points, s samples, k nearest neighbors, top-h hubs, a distance measure Filtering step: 1: Uniformly sample s points from X to form S 2: For each yi S, compute the k NN list k NN(yi, X) 3: Form the histogram H1 where the counter of xj is the frequency of xj on the s k NN lists in Line 2 4: Select top-s points with the highest frequency in H1 to form the candidate set C Verifying step: 5: For each xj X, compute the k NN list k NN(xj, C) 6: Form the histogram H2 where the counter of xj is the frequency of xj in the n k NN lists in Line 5 7: Return top-h points with the highest frequency as top-h hubs and compute the estimated skewness based on H2 Ensure: The top-h hubs and the estimated skewness score |
| Open Source Code | Yes | We implement Sam Hub with a few lines of Python codes1. 1https://github.com/Annie Dong-code/Sam Hub.git |
| Open Datasets | Yes | We conduct experiments on many realworld data sets in Table 1. |
| Dataset Splits | No | The paper mentions running experiments on real-world datasets and states that "All results are the average of 5 runs of the algorithms." However, it does not specify the train/test/validation splits or methodology used for these datasets. |
| Hardware Specification | Yes | We conduct experiments on a 3.2 GHz Core i7-5800 CPU with 64GB of RAM. |
| Software Dependencies | No | The paper mentions implementing Sam Hub with Python code and using ANNS libraries like Faiss, Hnswlib, and Sca NN. However, it does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | For all experiments, we set h = 10 and vary k and s. Figure 2 shows the accuracy of Sam Hub on many real-world data sets with various skewness while varying the number of samples s = {2%, . . . , 20%} n. The CNAE, Mini-ngs, Gisette, and Optdigits data sets exhibit skewness of 5.90, 8.45, 4.66, and 1.02 on L2. |