Distributed Submodular Maximization

Authors: Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, Andreas Krause

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar based clustering on tens of millions of examples using Hadoop. We extensively evaluate our algorithm on several machine learning problems, including exemplar based clustering, active set selection and finding cuts in graphs, and show that our approach leads to parallel solutions that are very competitive with those obtained via centralized methods (98% in exemplar based clustering, 97% in active set selection, 90% in finding cuts).
Researcher Affiliation Academia Baharan Mirzasoleiman EMAIL Department of Computer Science ETH Zurich Universitaetstrasse 6, 8092 Zurich, Switzerland Amin Karbasi EMAIL School of Engineering and Applied Science Yale University New Haven, USA Rik Sarkar EMAIL Department of Informatics University of Edinburgh 10 Crichton St, Edinburgh EH8 9AB, United Kingdom Andreas Krause EMAIL Department of Computer Science ETH Zurich Universitaetstrasse 6, 8092 Zurich, Switzerland
Pseudocode Yes Algorithm 1 Inefficient Distributed Submodular Maximization Algorithm 2 Greedy Distributed Submodular Maximization (Gree Di) Algorithm 3 Gree Di under General Constraints
Open Source Code No The paper does not contain any explicit statements about releasing source code or links to a code repository.
Open Datasets Yes We performed our experiments on a set of 10,000 Tiny Images (Torralba et al., 2008). Our second large scale experiment consists of 45,811,883 user visits from the Featured Tab of the Today Module on Yahoo! Front Page (web, 2012). We used the Parkinsons Telemonitoring dataset (Tsanas et al., 2010). We applied Gree Di to the submodular coverage problem in which given a collection V of sets, we would like to pick at most k sets from V in order to maximize the size of their union. We compared the performance of our Gree Di algorithm to the reported performance of Greedy Scaling on the same datasets, namely Accidents (Geurts et al., 2003) and Kosarak (Bodon, 2012).
Dataset Splits No The paper describes partitioning data for distributed processing (e.g., "partitioned the images uniformly at random to reducers") but does not provide specific details on train/test/validation splits for model evaluation.
Hardware Specification No Our experimental infrastructure was a cluster of 10 quad-core machines running Hadoop with the number of reducers set to m = 8000. Our experimental setup was a cluster of 8 quad-core machines running Spark with the number of reducers set to m = 32.
Software Dependencies No Our experimental infrastructure was a cluster of 10 quad-core machines running Hadoop... Our experimental setup was a cluster of 8 quad-core machines running Spark...
Experiment Setup Yes For Gree Di, we let each of the m machines select a set of size αk, and select a final solution of size k among the union of the m solutions (i.e., among αkm elements). ...with the number of exemplars set to k = 50, and varying number of partitions m. Our active set selection experiment involves Gree Di applied to the information gain f(S) (see Sec. 3.4) with Gaussian kernel, h = 0.75 and σ = 1. Each reducer separately performed the lazy greedy algorithm on its own set of 10,000 images...to extract 64 images with the highest marginal gains... We then merged the results and performed another round of lazy greedy selection on the merged results to extract the final 64 exemplars. Each reducer performed the lazy greedy algorithm on its own set of 1,431,621 vectors...in order to extract 256 elements... The maximum running time per reduce task was 12 minutes for selecting 128 elements and 48 minutes for selecting 256 elements.