reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

River: machine learning for streaming data in Python

Authors: Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Geoffrey Bolmier, Raphael Sourty, Robin Vaysse, Adil Zouitine, Heitor Murilo Gomes, Jesse Read, Talel Abdessalem, Albert Bifet

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark the implementation of 3 algorithms2 available in scikit-learn (Pedregosa et al., 2011), Creme and scikit-multiﬂow: Gaussian Naive Bayes (GNB), Logistic Regression (LR) (Hastie et al., 2009), and Hoeﬀding Tree (HT) (Hulten et al., 2001). Table 1 shows similar accuracy between implementations (as expected) for all models. Table 2 shows the processing time (learn and predict). River models perform at least as fast but overall faster than the rest. Tests are performed on the Elec2 data set (Harries and Wales, 1999) which has 45312 samples with 8 numerical features. Reported processing time is the average of running the experiment 7 times on a system with a 2.4 GHz Quad-Core Intel Core i5 processor and 16GB of RAM.
Researcher Affiliation	Collaboration	Jacob Montiel EMAIL AI Institute, University of Waikato, Hamilton, New Zealand LTCI, T el ecom Paris, Institut Polytechnique de Paris, Palaiseau, France Max Halford EMAIL Alan, Paris, France Saulo Martiello Mastelini EMAIL Institute of Mathematics and Computer Sciences, University of S ao Paulo, S ao Carlos, Brazil Geoﬀrey Bolmier EMAIL Volvo Car Corporation, G oteborg, Sweden Raphael Sourty EMAIL IRIT, Universit e Paul Sabatier, Toulouse, France Renault, Paris, France Robin Vaysse EMAIL IRIT, Universit e Paul Sabatier, Toulouse, France Octogone Lordat, Universit e Jean-Jaures, Toulouse, France Adil Zouitine EMAIL IRT Saint Exup ery, Toulouse, France Heitor Murilo Gomes EMAIL AI Institute, University of Waikato, Hamilton, New Zealand Jesse Read EMAIL LIX, Ecole Polytechnique, Institut Polytechnique de Paris, Palaiseau, France Talel Abdessalem EMAIL LTCI, T el ecom Paris, Institut Polytechnique de Paris, Palaiseau, France Albert Bifet EMAIL AI Institute, University of Waikato, Hamilton, New Zealand LTCI, T el ecom Paris, Institut Polytechnique de Paris, Palaiseau, France
Pseudocode	No	The paper includes code examples for demonstrating the library's usage, but it does not contain structured pseudocode or algorithm blocks (e.g., a section explicitly labeled 'Algorithm' or 'Pseudocode').
Open Source Code	Yes	The source code is available at https://github.com/online-ml/river.
Open Datasets	Yes	Tests are performed on the Elec2 data set (Harries and Wales, 1999) which has 45312 samples with 8 numerical features.
Dataset Splits	No	The paper mentions the Elec2 dataset and its size (45312 samples), but it does not provide specific details on how this dataset was split into training, validation, or test sets.
Hardware Specification	Yes	Reported processing time is the average of running the experiment 7 times on a system with a 2.4 GHz Quad-Core Intel Core i5 processor and 16GB of RAM.
Software Dependencies	No	The paper mentions Python and Cython as implementation languages and refers to scikit-learn, Creme, scikit-multiﬂow, and pandas.DataFrame, but it does not specify explicit version numbers for these software components used in the experiments.
Experiment Setup	No	The paper describes benchmarking algorithms and reporting their accuracy and processing time. However, it does not explicitly provide specific experimental setup details such as hyperparameters (e.g., learning rate, regularization strength, tree depth) or other training configurations for the algorithms benchmarked (GNB, LR, HT).