Using Meta-mining to Support Data Mining Workflow Planning and Optimization

Authors: P. Nguyen, M. Hilario, A. Kalousis

JAIR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the quality of the data mining workflows that the system produces on a collection of real world datasets coming from biology and show that it produces workflows that are significantly better than alternative methods that can only do workflow selection and not planning.
Researcher Affiliation Academia Phong Nguyen EMAIL Melanie Hilario EMAIL Department of Computer Science University of Geneva Switzerland Alexandros Kalousis EMAIL Department of Business Informatics University of Applied Sciences Western Switzerland, and Department of Computer Science University of Geneva Switzerland
Pseudocode No The paper does not contain any sections explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present any structured, code-like algorithmic blocks.
Open Source Code No The paper mentions external tools and projects like the Rapid Miner platform (Klinkenberg, Mierswa, & Fischer, 2007) and the e-LICO project (http://www.e-lico.eu, http://www.e-lico.eu/eproplan.html), which are used or related to the work. However, there is no explicit statement from the authors about making the source code for their specific methodology publicly available, nor is there a direct link to their own code repository.
Open Datasets Yes To construct the base-level experiments, we have collected 65 real world datasets on genomic microarray or proteomic data related to cancer diagnosis or prognosis, mostly from The National Center for Biotechnology Information5... Footnote 5: http://www.ncbi.nlm.nih.gov/
Dataset Splits Yes The performance measure we use is accuracy which we estimate using ten-fold cross-validation.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU/CPU models or memory specifications. It only mentions the use of the Rapid Miner DM suite for algorithm implementations.
Software Dependencies Yes Today s second generation knowledge discovery support systems (KDSS) allow complex modeling of workflows and contain several hundreds of operators; the Rapid Miner platform (Klinkenberg, Mierswa, & Fischer, 2007), in its extended version with Weka (Hall et al., 2009) and R (R Core Team, 2013), proposes actually more than 500 operators... For all algorithms, we used the implementations provided in the Rapid Miner DM suite (Klinkenberg et al., 2007).
Experiment Setup Yes When the planning goal g is the classification task, we will use as evaluation measure in our experiments the classification accuracy, estimated by ten-fold cross-validation, and do the significance testing using Mc Nemar s test, with a significance level of 0.05... For the two meta-learning methods, we fixed the number N of nearest neighbors to five... For planning, we set manually the dataset kernel width parameter to τ x k = 0.04 and the workflow kernel width parameter to τ w k = 0.08...