MLlib: Machine Learning in Apache Spark

Authors: Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we briefly demonstrate the speed, scalability, and continued improvements in MLlib over time. We first look at scalability by considering ALS, a commonly used collaborative filtering approach. For this benchmark, we worked with scaled copies of the Amazon Reviews dataset (Mc Auley and Leskovec, 2013), where we duplicated user information as necessary to increase the size of the data. We ran 5 iterations of MLlib s ALS for various scaled copies of the dataset, running on a 16 node EC2 cluster with m3.2xlarge instances using MLlib versions 1.1 and 1.4. For comparison purposes, we ran the same experiment using Apache Mahout version 0.9 (Mahout, 2014), which runs on Hadoop Map Reduce. Benchmarking results, presented in Figure 2(a), demonstrate that Map Reduce s scheduling overhead and lack of support for iterative computation substantially slow down its performance on moderately sized datasets. In contrast, MLlib exhibits excellent performance and scalability, and in fact can scale to much larger problems. Next, we compare MLlib versions 1.0 and 1.1 to evaluate improvement over time. We measure the performance of common machine learning methods in MLlib, with all experiments performed on EC2 using m3.2xlarge instances with 16 worker nodes and synthetic datasets from the spark-perf package (https://github.com/databricks/spark-perf). The results are presented in Figure 2(b), and show a 3 speedup on average across all algorithms.
Researcher Affiliation Collaboration Xiangrui Meng EMAIL Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Joseph Bradley EMAIL Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Burak Yavuz EMAIL Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Evan Sparks EMAIL UC Berkeley, 465 Soda Hall, Berkeley, CA 94720 Shivaram Venkataraman EMAIL UC Berkeley, 465 Soda Hall, Berkeley, CA 94720 Davies Liu EMAIL Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Jeremy Freeman EMAIL HHMI Janelia Research Campus, 19805 Helix Dr, Ashburn, VA 20147 DB Tsai EMAIL Netflix, 970 University Ave, Los Gatos, CA 95032 Manish Amde EMAIL Origami Logic, 1134 Crane Street, Menlo Park, CA 94025 Sean Owen EMAIL Cloudera UK, 33 Creechurch Lane, London EC3A 5EB United Kingdom Doris Xin EMAIL UIUC, 201 N Goodwin Ave, Urbana, IL 61801 Reynold Xin EMAIL Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Michael J. Franklin EMAIL UC Berkeley, 465 Soda Hall, Berkeley, CA 94720 Reza Zadeh EMAIL Stanford and Databricks, 475 Via Ortega, Stanford, CA 94305 Matei Zaharia EMAIL MIT and Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Ameet Talwalkar EMAIL UCLA and Databricks, 4732 Boelter Hall, Los Angeles, CA 90095
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes the functionality and algorithmic optimizations in paragraph text, for example: 'MLlib includes many optimizations to support efficient distributed learning and prediction. We highlight a few cases here. The ALS algorithm for recommendation makes careful use of blocking to reduce JVM garbage collection overhead and to leverage higher-level linear algebra operations. Decision trees use many ideas from the PLANET project (Panda et al., 2009), such as data-dependent feature discretization to reduce communication costs, and tree ensembles parallelize learning both within trees and across trees. Generalized linear models are learned via optimization algorithms which parallelize gradient computation, using fast C++-based linear algebra libraries for worker computations.'
Open Source Code Yes Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark s open-source distributed machine learning library.
Open Datasets Yes For this benchmark, we worked with scaled copies of the Amazon Reviews dataset (Mc Auley and Leskovec, 2013)... synthetic datasets from the spark-perf package (https://github.com/databricks/spark-perf).
Dataset Splits No The paper mentions using "scaled copies of the Amazon Reviews dataset" and "synthetic datasets from the spark-perf package," but it does not provide specific details on how these datasets were split into training, validation, or testing sets. For example, it states "we duplicated user information as necessary to increase the size of the data" for Amazon Reviews, but this does not describe a dataset split.
Hardware Specification Yes running on a 16 node EC2 cluster with m3.2xlarge instances
Software Dependencies No The user guide also lists MLlib s code dependencies, which as of version 1.4 are the following open-source libraries: Breeze, netlib-java, and (in Python) Num Py (Breeze, 2015; Halliday, 2015; Braun, 2015; Num Py, 2015). These dependencies are listed without specific version numbers, which are required for reproducibility.
Experiment Setup Yes We ran 5 iterations of MLlib s ALS for various scaled copies of the dataset, running on a 16 node EC2 cluster with m3.2xlarge instances using MLlib versions 1.1 and 1.4... We measure the performance of common machine learning methods in MLlib, with all experiments performed on EC2 using m3.2xlarge instances with 16 worker nodes and synthetic datasets from the spark-perf package (https://github.com/databricks/spark-perf).