reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Multilabel Classification Framework for Approximate Nearest Neighbor Search

Authors: Ville Hyvönen, Elias Jääsaari, Teemu Roos

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present empirical results validating the utility of the proposed framework. In particular, we compare the natural classiﬁer to the earlier candidate set selection methods (lookup search and voting search) when combined with diﬀerent types of unsupervised trees that have been widely used for ANN search. Speciﬁcally, we use ensembles of randomized k-d trees (Friedman et al., 1976; Silpa-Anan and Hartley, 2008), random projection (RP) trees (Dasgupta and Freund, 2008; Dasgupta and Sinha, 2015; Hyv onen et al., 2016), and principal component (PCA) trees (Sproull, 1991; J a asaari et al., 2019) (see Sec. 7.3 and Appendix B for detailed descriptions of these data structures). Another consequence of the multilabel formulation of Sec. 3.1 is that it enables using any established multilabel classiﬁer for ANN search. To demonstrate this concretely, we train a random forest consisting of standard multilabel classiﬁcation trees (trained under the PAL reduction (Reddi et al., 2019) by using multinomial log-likelihood as a split criterion) and use it as an index structure for ANN search; it turns out that the fully supervised classiﬁcation trees have an improved performance compared to the earlier unsupervised trees on some but, curiously, not on all data sets. We follow a standard ANN search performance evaluation setting (Aum uller et al., 2019a; Li et al., 2019) by using the corpus as the training set, searching for k = 10 nearest neighbors in Euclidean distance, and measuring performance by evaluating average recall and query time over the test set of 1000 points. We use four benchmark data sets: Fashion (m = 60000, d = 784), GIST (m = 1000000, d = 960), Trevi (m = 101120, d = 4096), and STL-10 (m = 100000, d = 9216).
Researcher Affiliation	Academia	Ville Hyv onen EMAIL Department of Computer Science Aalto University Elias J a asaari EMAIL Machine Learning Department Carnegie Mellon University Teemu Roos EMAIL Department of Computer Science University of Helsinki
Pseudocode	Yes	Algorithm 1 Natural classiﬁer for ANN search using a single partition...Algorithm 2 Natural classiﬁer for ANN search using multiple partitions...Algorithm 3 Grow a random projection (RP) tree (Dasgupta and Freund, 2008; Dasgupta and Sinha, 2015; Hyv onen et al., 2016)...Algorithm 4 Grow a chronological k-d tree (Bentley, 1975)...Algorithm 5 Grow a randomized multiclass classiﬁcation tree...Algorithm 6 Grow a randomized k-d tree (Silpa-Anan and Hartley, 2008)...Algorithm 7 Grow a randomized PCA tree (J a asaari et al., 2019)...Algorithm 8 Generate projection vector for a randomized PCA tree (J a asaari et al., 2019)
Open Source Code	Yes	The code used to produce the experimental results is attached as supplementary material and can also be found at https://github.com/ vioshyvo/a-multilabel-classification-framework.
Open Datasets	Yes	We used four publicly available and commonly used benchmark data sets (Fashion14, GIST15, STL-1016, and Trevi17) consisting of raw or preprocessed images. 14. https://github.com/zalandoresearch/fashion-mnist 15. http://corpus-texmex.irisa.fr 16. https://cs.stanford.edu/~acoates/stl10 17. http://phototour.cs.washington.edu/patches
Dataset Splits	Yes	We randomly divided the original data sets into the corpus, the validation set (nvalidation = 1000), and the test set (ntest = 1000). The corpus {ci}m i=1 was used as a training set.
Hardware Specification	Yes	The experiments were ran on a machine with two Xeon E5-2680 v4 2.4GHz processors, 256GB RAM and Cent OS 7 as the operating system.
Software Dependencies	Yes	The algorithms and test code were written in C++14 and compiled using GCC 5.4.0 with the optimization ﬂags -Ofast and -march=native.
Experiment Setup	Yes	Here we describe the hyperparameter tuning process. According to our initial experiments, the performance of each type of a tree was robust w.r.t. its sparsity/randomization parameter (the number of the uniformly at random chosen coordinate directions a {1, . . . , d} from which the optimal split direction was chosen for multiclass classiﬁcation trees; the dimensionality a of the random subspace in which ﬁrst principal component was approximated for PCA trees; the expected number a of the non-zero components in the random vectors onto which the node points are projected in RP trees; and the number o of the highest variance directions of the node points from which the split direction was chosen uniformly at random in k-d trees). Therefore, we kept these parameters ﬁxed in the ﬁnal experiments: for multiclass classiﬁcation trees and the PCA trees we used the value a = d ; for the RP trees the value a = 1/ d; and for the k-d trees the value o = 5. Further, we set the learning rate of the iterative PCA algorithm in PCA trees to γ = 0.01 and the maximum number of iterations to t = 20. For the recall levels on the range [0.5, 0.99] considered in the article, the optimal numbers of trees T were generally on the range [5, 200], the optimal depths of the trees were on the range [10, 15], and the optimal values of the threshold parameter τ were on the range [1, 20] for PCA, RP, and k-d trees that use the raw counts as score function values. For multiclass classiﬁcation trees that use the probability estimates (7) to select the candidate set, the optimal values of the threshold parameter τ were on the range [0.00001, 0.005]. We observed that using a value k > k to learn the trees sometimes improved performance. We tested values k {10, 50, 100} for learning the trees, with k = 10 or k = 50 usually being the optimal parameter value when k = 10.