A Multilabel Classification Framework for Approximate Nearest Neighbor Search

Authors: Ville Hyvönen, Elias Jääsaari, Teemu Roos

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present empirical results validating the utility of the proposed framework. In particular, we compare the natural classifier to the earlier candidate set selection methods (lookup search and voting search) when combined with different types of unsupervised trees that have been widely used for ANN search. Specifically, we use ensembles of randomized k-d trees (Friedman et al., 1976; Silpa-Anan and Hartley, 2008), random projection (RP) trees (Dasgupta and Freund, 2008; Dasgupta and Sinha, 2015; Hyv onen et al., 2016), and principal component (PCA) trees (Sproull, 1991; J a asaari et al., 2019) (see Sec. 7.3 and Appendix B for detailed descriptions of these data structures). Another consequence of the multilabel formulation of Sec. 3.1 is that it enables using any established multilabel classifier for ANN search. To demonstrate this concretely, we train a random forest consisting of standard multilabel classification trees (trained under the PAL reduction (Reddi et al., 2019) by using multinomial log-likelihood as a split criterion) and use it as an index structure for ANN search; it turns out that the fully supervised classification trees have an improved performance compared to the earlier unsupervised trees on some but, curiously, not on all data sets. We follow a standard ANN search performance evaluation setting (Aum uller et al., 2019a; Li et al., 2019) by using the corpus as the training set, searching for k = 10 nearest neighbors in Euclidean distance, and measuring performance by evaluating average recall and query time over the test set of 1000 points. We use four benchmark data sets: Fashion (m = 60000, d = 784), GIST (m = 1000000, d = 960), Trevi (m = 101120, d = 4096), and STL-10 (m = 100000, d = 9216).
Researcher Affiliation Academia Ville Hyv onen EMAIL Department of Computer Science Aalto University Elias J a asaari EMAIL Machine Learning Department Carnegie Mellon University Teemu Roos EMAIL Department of Computer Science University of Helsinki
Pseudocode Yes Algorithm 1 Natural classifier for ANN search using a single partition...Algorithm 2 Natural classifier for ANN search using multiple partitions...Algorithm 3 Grow a random projection (RP) tree (Dasgupta and Freund, 2008; Dasgupta and Sinha, 2015; Hyv onen et al., 2016)...Algorithm 4 Grow a chronological k-d tree (Bentley, 1975)...Algorithm 5 Grow a randomized multiclass classification tree...Algorithm 6 Grow a randomized k-d tree (Silpa-Anan and Hartley, 2008)...Algorithm 7 Grow a randomized PCA tree (J a asaari et al., 2019)...Algorithm 8 Generate projection vector for a randomized PCA tree (J a asaari et al., 2019)
Open Source Code Yes The code used to produce the experimental results is attached as supplementary material and can also be found at https://github.com/ vioshyvo/a-multilabel-classification-framework.
Open Datasets Yes We used four publicly available and commonly used benchmark data sets (Fashion14, GIST15, STL-1016, and Trevi17) consisting of raw or preprocessed images. 14. https://github.com/zalandoresearch/fashion-mnist 15. http://corpus-texmex.irisa.fr 16. https://cs.stanford.edu/~acoates/stl10 17. http://phototour.cs.washington.edu/patches
Dataset Splits Yes We randomly divided the original data sets into the corpus, the validation set (nvalidation = 1000), and the test set (ntest = 1000). The corpus {ci}m i=1 was used as a training set.
Hardware Specification Yes The experiments were ran on a machine with two Xeon E5-2680 v4 2.4GHz processors, 256GB RAM and Cent OS 7 as the operating system.
Software Dependencies Yes The algorithms and test code were written in C++14 and compiled using GCC 5.4.0 with the optimization flags -Ofast and -march=native.
Experiment Setup Yes Here we describe the hyperparameter tuning process. According to our initial experiments, the performance of each type of a tree was robust w.r.t. its sparsity/randomization parameter (the number of the uniformly at random chosen coordinate directions a {1, . . . , d} from which the optimal split direction was chosen for multiclass classification trees; the dimensionality a of the random subspace in which first principal component was approximated for PCA trees; the expected number a of the non-zero components in the random vectors onto which the node points are projected in RP trees; and the number o of the highest variance directions of the node points from which the split direction was chosen uniformly at random in k-d trees). Therefore, we kept these parameters fixed in the final experiments: for multiclass classification trees and the PCA trees we used the value a = d ; for the RP trees the value a = 1/ d; and for the k-d trees the value o = 5. Further, we set the learning rate of the iterative PCA algorithm in PCA trees to γ = 0.01 and the maximum number of iterations to t = 20. For the recall levels on the range [0.5, 0.99] considered in the article, the optimal numbers of trees T were generally on the range [5, 200], the optimal depths of the trees were on the range [10, 15], and the optimal values of the threshold parameter τ were on the range [1, 20] for PCA, RP, and k-d trees that use the raw counts as score function values. For multiclass classification trees that use the probability estimates (7) to select the candidate set, the optimal values of the threshold parameter τ were on the range [0.00001, 0.005]. We observed that using a value k > k to learn the trees sometimes improved performance. We tested values k {10, 50, 100} for learning the trees, with k = 10 or k = 50 usually being the optimal parameter value when k = 10.