SAMOA: Scalable Advanced Massive Online Analysis

Authors: Gianmarco De Francisci Morales, Albert Bifet

JMLR 2015 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental samoa (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza. The following listing shows how to download, build and run samoa. # run SAMOA in local mode bin/samoa local target/SAMOA -Local -0.2.0 SNAPSHOT.jar " Prequential Evaluation -l classifiers.ensemble.Bagging -s ( Arff File Stream -f covtype Norm.arff) -f 100000"
Researcher Affiliation Industry Gianmarco De Francisci Morales EMAIL Albert Bifet EMAIL Yahoo Labs Av. Diagonal 177, 8th floor, 08018, Barcelona, Spain
Pseudocode Yes The following is a code snippet to build a topology that joins two data streams in samoa. Topology Builder builder = new Topology Builder (); Processor source One = new Source Processor (); builder.add Processor(source One); Stream stream One = builder.create Stream (source One); Processor source Two = new Source Processor (); builder.add Processor(source Two); Stream stream Two = builder.create Stream (source Two); Processor join = new Join Processor (); builder.add Processor(join). connect Input Shuffle (stream One) . connect Input Key (stream Two);
Open Source Code Yes samoa is written in Java, is open source, and is available at http://samoa-project.net under the Apache Software License version 2.0. The code is hosted on Git Hub.
Open Datasets Yes # download the Forest Cover Type data set wget "http :// downloads.sourceforge.net/project/moa -datastream/Datasets/ Classification /covtype Norm.arff.zip" unzip "covtype Norm.arff.zip"
Dataset Splits No The paper uses a streaming model where data arrives sequentially, and evaluation is 'Prequential Evaluation' on the 'covtype Norm.arff' dataset. However, it does not specify traditional training/test/validation splits (e.g., percentages, sample counts, or predefined files) as needed for direct reproduction of static dataset partitioning.
Hardware Specification No The paper discusses running on 'several distributed stream processing engines' and 'workload across several machines' but does not provide specific hardware details like CPU models, GPU types, or memory specifications used for experiments.
Software Dependencies No samoa is written in Java. It also mentions running on 'Storm, S4, and Samza' and providing 'connectors for moa', but specific version numbers for these software dependencies are not provided in the text.
Experiment Setup Yes The run command example includes parameters such as 'Prequential Evaluation -l classifiers.ensemble.Bagging -s ( Arff File Stream -f covtype Norm.arff) -f 100000'. Additionally, for Clu Stream, it states: 'The period can be configured via a command line parameter (e.g., every 10 000 examples).'