reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distributed Submodular Maximization

Authors: Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, Andreas Krause

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our extensive experiments, we demonstrate the eﬀectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar based clustering on tens of millions of examples using Hadoop. We extensively evaluate our algorithm on several machine learning problems, including exemplar based clustering, active set selection and ﬁnding cuts in graphs, and show that our approach leads to parallel solutions that are very competitive with those obtained via centralized methods (98% in exemplar based clustering, 97% in active set selection, 90% in ﬁnding cuts).
Researcher Affiliation	Academia	Baharan Mirzasoleiman EMAIL Department of Computer Science ETH Zurich Universitaetstrasse 6, 8092 Zurich, Switzerland Amin Karbasi EMAIL School of Engineering and Applied Science Yale University New Haven, USA Rik Sarkar EMAIL Department of Informatics University of Edinburgh 10 Crichton St, Edinburgh EH8 9AB, United Kingdom Andreas Krause EMAIL Department of Computer Science ETH Zurich Universitaetstrasse 6, 8092 Zurich, Switzerland
Pseudocode	Yes	Algorithm 1 Ineﬃcient Distributed Submodular Maximization Algorithm 2 Greedy Distributed Submodular Maximization (Gree Di) Algorithm 3 Gree Di under General Constraints
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or links to a code repository.
Open Datasets	Yes	We performed our experiments on a set of 10,000 Tiny Images (Torralba et al., 2008). Our second large scale experiment consists of 45,811,883 user visits from the Featured Tab of the Today Module on Yahoo! Front Page (web, 2012). We used the Parkinsons Telemonitoring dataset (Tsanas et al., 2010). We applied Gree Di to the submodular coverage problem in which given a collection V of sets, we would like to pick at most k sets from V in order to maximize the size of their union. We compared the performance of our Gree Di algorithm to the reported performance of Greedy Scaling on the same datasets, namely Accidents (Geurts et al., 2003) and Kosarak (Bodon, 2012).
Dataset Splits	No	The paper describes partitioning data for distributed processing (e.g., "partitioned the images uniformly at random to reducers") but does not provide specific details on train/test/validation splits for model evaluation.
Hardware Specification	No	Our experimental infrastructure was a cluster of 10 quad-core machines running Hadoop with the number of reducers set to m = 8000. Our experimental setup was a cluster of 8 quad-core machines running Spark with the number of reducers set to m = 32.
Software Dependencies	No	Our experimental infrastructure was a cluster of 10 quad-core machines running Hadoop... Our experimental setup was a cluster of 8 quad-core machines running Spark...
Experiment Setup	Yes	For Gree Di, we let each of the m machines select a set of size αk, and select a ﬁnal solution of size k among the union of the m solutions (i.e., among αkm elements). ...with the number of exemplars set to k = 50, and varying number of partitions m. Our active set selection experiment involves Gree Di applied to the information gain f(S) (see Sec. 3.4) with Gaussian kernel, h = 0.75 and σ = 1. Each reducer separately performed the lazy greedy algorithm on its own set of 10,000 images...to extract 64 images with the highest marginal gains... We then merged the results and performed another round of lazy greedy selection on the merged results to extract the ﬁnal 64 exemplars. Each reducer performed the lazy greedy algorithm on its own set of 1,431,621 vectors...in order to extract 256 elements... The maximum running time per reduce task was 12 minutes for selecting 128 elements and 48 minutes for selecting 256 elements.