reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Apache Mahout: Machine Learning on Distributed Dataflow Systems

Authors: Robin Anil, Gokhan Capan, Isabel Drost-Fromm, Ted Dunning, Ellen Friedman, Trevor Grant, Shannon Quinn, Paritosh Ranjan, Sebastian Schelter, Özgür Yılmazel

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Figure 1c illustrates the beneﬁts of these optimizations for solving a large regression problem (Schelter et al. (2016)), where the automatic rewrites and specialized operators provide a signiﬁcant speedup compared to execution without optimizations.
Researcher Affiliation	Collaboration	Robin Anil EMAIL Tock, Chicago, US Gokhan Capan EMAIL Persona.tech & Bogazici University, Istanbul, Turkey Isabel Drost-Fromm EMAIL Europace AG, Berlin, Germany Ted Dunning EMAIL Ellen Friedman EMAIL Hewlett-Packard Enterprise, Mountain View, US Trevor Grant EMAIL IBM, Chicago, US Shannon Quinn EMAIL University of Georgia, Athens, US Paritosh Ranjan EMAIL IBM, Kolkata, IN Sebastian Schelter EMAIL University of Amsterdam, Amsterdam, NL Ozgur Yılmazel EMAIL Anadolu University, Tepebasi / Eskisehir, Turkey
Pseudocode	Yes	def dridge( drm X: Drm Like[Int], drm Y: Drm Like[Int], lambda: Double ): Matrix { val Xt X = (drm X.t %% drm X ). collect val Xt Y = (drm X.t %% drm Y ). collect solve(Xt X , Xt Y) }
Open Source Code	Yes	Mahout is maintained as a community-driven open source project at the Apache Software Foundation, and is available under https://mahout.apache.org.
Open Datasets	No	The paper discusses the capabilities of the Apache Mahout library and its algorithms. While it mentions the general use of data and discusses how ML algorithms operate on data (e.g., "large text corpora"), it does not specify any particular dataset used for experiments in this paper or provide access information for any such dataset.
Dataset Splits	No	The paper describes the Apache Mahout library and its architecture, and mentions performance benefits from optimizations. However, it does not describe specific experiments on datasets with details regarding training, validation, or test splits. No dataset is concretely specified with access information, thus no splits are mentioned.
Hardware Specification	No	The paper mentions general hardware terms like "modern hardware like GPUs" and a "current eﬀort is underway to support the native execution of costly matrix operations on GPUs via an integration of the Vienna CL (Rupp et al. (2010)) framework" which implies future or ongoing work. However, it does not provide specific details (e.g., model numbers, memory, or processor types) of the hardware used for any reported results, such as the optimization benefits shown in Figure 1c.
Software Dependencies	Yes	The latest version v0.14 requires at least Java 8 and Scala 2.11 for Samsara. The legacy algorithms require Hadoop 2.4, while Samsara programs can be executed on Spark 2.x and Flink 1.1.
Experiment Setup	No	The paper focuses on the architecture and evolution of the Apache Mahout library, highlighting its capabilities and optimization benefits. While Figure 1c illustrates optimization benefits for a regression problem, no specific experimental setup details, hyperparameters, or training configurations for any machine learning task are provided in the main text of the paper.