reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SAMOA: Scalable Advanced Massive Online Analysis

Authors: Gianmarco De Francisci Morales, Albert Bifet

JMLR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	samoa (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classiﬁcation, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza. The following listing shows how to download, build and run samoa. # run SAMOA in local mode bin/samoa local target/SAMOA -Local -0.2.0 SNAPSHOT.jar " Prequential Evaluation -l classifiers.ensemble.Bagging -s ( Arff File Stream -f covtype Norm.arff) -f 100000"
Researcher Affiliation	Industry	Gianmarco De Francisci Morales EMAIL Albert Bifet EMAIL Yahoo Labs Av. Diagonal 177, 8th ﬂoor, 08018, Barcelona, Spain
Pseudocode	Yes	The following is a code snippet to build a topology that joins two data streams in samoa. Topology Builder builder = new Topology Builder (); Processor source One = new Source Processor (); builder.add Processor(source One); Stream stream One = builder.create Stream (source One); Processor source Two = new Source Processor (); builder.add Processor(source Two); Stream stream Two = builder.create Stream (source Two); Processor join = new Join Processor (); builder.add Processor(join). connect Input Shuffle (stream One) . connect Input Key (stream Two);
Open Source Code	Yes	samoa is written in Java, is open source, and is available at http://samoa-project.net under the Apache Software License version 2.0. The code is hosted on Git Hub.
Open Datasets	Yes	# download the Forest Cover Type data set wget "http :// downloads.sourceforge.net/project/moa -datastream/Datasets/ Classification /covtype Norm.arff.zip" unzip "covtype Norm.arff.zip"
Dataset Splits	No	The paper uses a streaming model where data arrives sequentially, and evaluation is 'Prequential Evaluation' on the 'covtype Norm.arff' dataset. However, it does not specify traditional training/test/validation splits (e.g., percentages, sample counts, or predefined files) as needed for direct reproduction of static dataset partitioning.
Hardware Specification	No	The paper discusses running on 'several distributed stream processing engines' and 'workload across several machines' but does not provide specific hardware details like CPU models, GPU types, or memory specifications used for experiments.
Software Dependencies	No	samoa is written in Java. It also mentions running on 'Storm, S4, and Samza' and providing 'connectors for moa', but specific version numbers for these software dependencies are not provided in the text.
Experiment Setup	Yes	The run command example includes parameters such as 'Prequential Evaluation -l classifiers.ensemble.Bagging -s ( Arff File Stream -f covtype Norm.arff) -f 100000'. Additionally, for Clu Stream, it states: 'The period can be conﬁgured via a command line parameter (e.g., every 10 000 examples).'