reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Scalable and Efficient Iterative Method for Copying Machine Learning Classifiers

Authors: Nahuel Statuto, Irene Unceta, Jordi Nin, Oriol Pujol

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The eﬀectiveness of the sequential approach is demonstrated through experiments with synthetic and real-world datasets, showing significant reductions in time and resources, while maintaining or improving accuracy. We present the results of our experimental study on the sequential copy approach applied to a set of heterogeneous problems. Our results are analyzed using various performance metrics and compared to the single-pass approach in both one-shot and online settings.
Researcher Affiliation	Academia	Nahuel Statuto EMAIL Department of Operations, Innovation and Data Sciences Universitat Ramon Llull, ESADE Sant Cugat del Vall es, 08172, Catalonia, Spain Irene Unceta EMAIL Department of Operations, Innovation and Data Sciences Universitat Ramon Llull, ESADE Sant Cugat del Vall es, 08172, Catalonia, Spain Jordi Nin EMAIL Department of Operations, Innovation and Data Sciences Universitat Ramon Llull, ESADE Sant Cugat del Vall es, 08172, Catalonia, Spain Oriol Pujol oriol EMAIL Departament de Matem atiques i Inform atica Universitat de Barcelona Barcelona, 08007, Catalonia, Spain
Pseudocode	Yes	Algorithm 1 Preliminary version of the sequential approach... Algorithm 2 Sample selection policy... Algorithm 3 Empirical risk minimization implementation... Algorithm 4 Memory aware empirical risk minimization implementation... Algorithm 5 Sequential approach with alternating optimization
Open Source Code	No	The paper does not provide any explicit statement about the release of source code, nor does it include a link to a code repository.
Open Datasets	Yes	We use 58 datasets from the UCI Machine Learning Repository database (Dheeru and Karra Taniskidou, 2017) and follow the experimental methodology outlined in (Unceta et al., 2020a).
Dataset Splits	Yes	We split the pre-processed data into stratiﬁed 80/20 training and test sets.
Hardware Specification	Yes	We perform all experiments on a server with 28 dual-core AMD EPYC 7453 at 2.75 GHz, and equipped with 768 Gb RDIMM/3200 of RAM. The server runs on Linux 5.4.0.
Software Dependencies	Yes	We implement all experiment in Python 3.10.4 and train copies using Tensor Flow 2.8.
Experiment Setup	Yes	We train these models sequentially for 30 iterations. At each iteration t, we generate n = 100 new data points by randomly sampling a standard normal distribution N(0, 1). We use Algorithm 5, discarding any data points for which the instantaneous copy model has an uncertainty ρ below a deﬁned threshold δ. We adjust the weights at each iteration using the Adam optimizer with a learning rate of 5 10 4. For each value of t, we use 1000 epochs with balanced batches of 32 data points. We use the previously deﬁned normalized uncertainty average as the loss function and evaluate the impact of the δ parameter by running independent trials for δ {5 10 4, 10 4, 5 10 5, 10 5, 5 10 6, 10 6, 5 10 7, 10 7, 5 10 8, 10 8, 10 9, 10 10}. Additionally, we allow the λ parameter to be updated automatically, starting from a value of 0.5.