Graph-Based Algorithms for Diverse Similarity Search

Authors: Piyush Anand, Piotr Indyk, Ravishankar Krishnaswamy, Sepideh Mahabadi, Vikas C. Raykar, Kirankumar Shiragur, Haike Xu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that the search time of our algorithms is substantially lower than that using the standard two-stage approach. ... We provide an empirical evaluation of our methods. ... We run experiments using our heuristics on several different datasets, both real-world as well as synthetically generated.
Researcher Affiliation Collaboration 1Microsoft 2MIT 3Microsoft Research. Correspondence to: Sepideh Mahabadi <EMAIL>, Haike Xu <EMAIL>.
Pseudocode Yes Algorithm 1 Indexing algorithm for colorful NN ... Algorithm 2 Search algorithm for colorful NN ... Algorithm 3 Indexing algorithm for diverse NN ... Algorithm 4 Search algorithm for primal diverse NN ... Algorithm 5 Search algorithm for dual diverse NN ... Algorithm 6 Insert (p, d, c) into Diverse Priority Queue (Q, L, k ) ... Algorithm 7 Diverse Search(G, s, q, k , k, L) ... Algorithm 8 Diverse Prune(p, V, α, R, m) ... Algorithm 9 Diverse Index(P, α, L, R, m) ... Algorithm 10 Robust Pruning(i, U, α, R) ... Algorithm 11 Greedy Search(s, q, L) ... Algorithm 12 Disk ANN indexing algorithm (with fast preprocessing) ... Algorithm 13 Disk ANN indexing algorithm (with slow preprocessing)
Open Source Code Yes The code is available at https://github.com/microsoft/Disk ANN/tree/diversity
Open Datasets Yes Real-world Seller Dataset ... Amazon Automotive Dataset: Our second real-world dataset is derived from the recently released Amazon dataset (Simhadri et al., 2024). ... Skewed Semi-synthetic Datasets: We also consider the publicly available real-world Arxiv dataset (Embeddings, 2024) which contains Open AI embeddings of around 2 million paper abstracts into 1536 dimensional vectors and the classical SIFT dataset of 1M vectors in 128 dimensions.
Dataset Splits Yes Our first real-world seller dataset comprises of 64-dimensional vector embeddings of different products from a large advertisement corpus. ... There are 20 million base vectors, around 2500 sellers, and 5000 query vectors. ... Amazon Automotive Dataset ... There are around 2 million base vectors and around 85000 brands. ... for the Arxiv dataset, ... around 2 million paper abstracts ... SIFT dataset of 1M vectors in 128 dimensions.
Hardware Specification Yes All experiments were run on a Linux Machine with AMD Ryzen Threadripper 3960X 24-Core Processor CPU s @ 2.3GHz with 48 v CPUs and 250 GB RAM.
Software Dependencies No The paper does not provide specific software names with version numbers (e.g., Python, PyTorch, CUDA, or specific library versions) that were used to implement the algorithms or run the experiments. It mentions various graph-based ANN algorithms like HNSW, NGT, and Disk ANN, but not the specific software versions of their implementations or any other dependencies.
Experiment Setup Yes Parameter setup. For all of the above algorithms, we use fairly standard parameters of list-size L = 200 and graph-degree 64 when building the graphs. During search, we vary the list-size L at search time to get varying quality search results and plot the recall@100 vs average query latency. ... In our setting, we additionally enforce that an edge needs to be blocked by edges of m different colors to not be added to the graph, where m is a tuneable parameter.