reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Using Meta-mining to Support Data Mining Workflow Planning and Optimization

Authors: P. Nguyen, M. Hilario, A. Kalousis

JAIR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the quality of the data mining workﬂows that the system produces on a collection of real world datasets coming from biology and show that it produces workﬂows that are signiﬁcantly better than alternative methods that can only do workﬂow selection and not planning.
Researcher Affiliation	Academia	Phong Nguyen EMAIL Melanie Hilario EMAIL Department of Computer Science University of Geneva Switzerland Alexandros Kalousis EMAIL Department of Business Informatics University of Applied Sciences Western Switzerland, and Department of Computer Science University of Geneva Switzerland
Pseudocode	No	The paper does not contain any sections explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present any structured, code-like algorithmic blocks.
Open Source Code	No	The paper mentions external tools and projects like the Rapid Miner platform (Klinkenberg, Mierswa, & Fischer, 2007) and the e-LICO project (http://www.e-lico.eu, http://www.e-lico.eu/eproplan.html), which are used or related to the work. However, there is no explicit statement from the authors about making the source code for their specific methodology publicly available, nor is there a direct link to their own code repository.
Open Datasets	Yes	To construct the base-level experiments, we have collected 65 real world datasets on genomic microarray or proteomic data related to cancer diagnosis or prognosis, mostly from The National Center for Biotechnology Information5... Footnote 5: http://www.ncbi.nlm.nih.gov/
Dataset Splits	Yes	The performance measure we use is accuracy which we estimate using ten-fold cross-validation.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU/CPU models or memory specifications. It only mentions the use of the Rapid Miner DM suite for algorithm implementations.
Software Dependencies	Yes	Today s second generation knowledge discovery support systems (KDSS) allow complex modeling of workﬂows and contain several hundreds of operators; the Rapid Miner platform (Klinkenberg, Mierswa, & Fischer, 2007), in its extended version with Weka (Hall et al., 2009) and R (R Core Team, 2013), proposes actually more than 500 operators... For all algorithms, we used the implementations provided in the Rapid Miner DM suite (Klinkenberg et al., 2007).
Experiment Setup	Yes	When the planning goal g is the classiﬁcation task, we will use as evaluation measure in our experiments the classiﬁcation accuracy, estimated by ten-fold cross-validation, and do the signiﬁcance testing using Mc Nemar s test, with a signiﬁcance level of 0.05... For the two meta-learning methods, we ﬁxed the number N of nearest neighbors to ﬁve... For planning, we set manually the dataset kernel width parameter to τ x k = 0.04 and the workﬂow kernel width parameter to τ w k = 0.08...