Apache Mahout: Machine Learning on Distributed Dataflow Systems
Authors: Robin Anil, Gokhan Capan, Isabel Drost-Fromm, Ted Dunning, Ellen Friedman, Trevor Grant, Shannon Quinn, Paritosh Ranjan, Sebastian Schelter, Özgür Yılmazel
JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 1c illustrates the benefits of these optimizations for solving a large regression problem (Schelter et al. (2016)), where the automatic rewrites and specialized operators provide a significant speedup compared to execution without optimizations. |
| Researcher Affiliation | Collaboration | Robin Anil EMAIL Tock, Chicago, US Gokhan Capan EMAIL Persona.tech & Bogazici University, Istanbul, Turkey Isabel Drost-Fromm EMAIL Europace AG, Berlin, Germany Ted Dunning EMAIL Ellen Friedman EMAIL Hewlett-Packard Enterprise, Mountain View, US Trevor Grant EMAIL IBM, Chicago, US Shannon Quinn EMAIL University of Georgia, Athens, US Paritosh Ranjan EMAIL IBM, Kolkata, IN Sebastian Schelter EMAIL University of Amsterdam, Amsterdam, NL Ozgur Yılmazel EMAIL Anadolu University, Tepebasi / Eskisehir, Turkey |
| Pseudocode | Yes | def dridge( drm X: Drm Like[Int], drm Y: Drm Like[Int], lambda: Double ): Matrix { val Xt X = (drm X.t %*% drm X ). collect val Xt Y = (drm X.t %*% drm Y ). collect solve(Xt X , Xt Y) } |
| Open Source Code | Yes | Mahout is maintained as a community-driven open source project at the Apache Software Foundation, and is available under https://mahout.apache.org. |
| Open Datasets | No | The paper discusses the capabilities of the Apache Mahout library and its algorithms. While it mentions the general use of data and discusses how ML algorithms operate on data (e.g., "large text corpora"), it does not specify any particular dataset used for experiments in this paper or provide access information for any such dataset. |
| Dataset Splits | No | The paper describes the Apache Mahout library and its architecture, and mentions performance benefits from optimizations. However, it does not describe specific experiments on datasets with details regarding training, validation, or test splits. No dataset is concretely specified with access information, thus no splits are mentioned. |
| Hardware Specification | No | The paper mentions general hardware terms like "modern hardware like GPUs" and a "current effort is underway to support the native execution of costly matrix operations on GPUs via an integration of the Vienna CL (Rupp et al. (2010)) framework" which implies future or ongoing work. However, it does not provide specific details (e.g., model numbers, memory, or processor types) of the hardware used for any reported results, such as the optimization benefits shown in Figure 1c. |
| Software Dependencies | Yes | The latest version v0.14 requires at least Java 8 and Scala 2.11 for Samsara. The legacy algorithms require Hadoop 2.4, while Samsara programs can be executed on Spark 2.x and Flink 1.1. |
| Experiment Setup | No | The paper focuses on the architecture and evolution of the Apache Mahout library, highlighting its capabilities and optimization benefits. While Figure 1c illustrates optimization benefits for a regression problem, no specific experimental setup details, hyperparameters, or training configurations for any machine learning task are provided in the main text of the paper. |