NetSDM: Semantic Data Mining with Network Analysis

Authors: Jan Kralj, Marko Robnik-Sikonja, Nada Lavrac

JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental evaluation of the Net SDM methodology on acute lymphoblastic leukemia and breast cancer data demonstrates that Net SDM achieves radical time efficiency improvements and that learned rules are comparable or better than the rules obtained by the original SDM algorithms.
Researcher Affiliation Academia Jan Kralj EMAIL Jožef Stefan Institute, Department of Knowledge Technologies, Jamova 39, 1000 Ljubljana, Slovenia Marko Robnik-Šikonja EMAIL University of Ljubljana, Faculty of Computer and Information Science, Večna pot 113, 1000 Ljubljana, Slovenia Nada Lavrač EMAIL Jožef Stefan Institute, Department of Knowledge Technologies, Jamova 39, 1000 Ljubljana, Slovenia
Pseudocode Yes Algorithm 1: The Net SDM algorithm, implementing the proposed approach to semantic data mining with network node ranking and ontology shrinking. Algorithm 2: The algorithm for removing a node from a network, obtained through direct conversion of the background knowledge into the information network format.
Open Source Code No The paper does not provide concrete access to its own source code for the Net SDM methodology. It mentions the Aleph manual link, but this is for a third-party tool, not the authors' implementation for this paper.
Open Datasets Yes ALL (acute lymphoblastic leukemia) data. The ALL data set, introduced by Chiaretti et al. (2004), is a typical dataset for medical research. Breast cancer data. The breast cancer data set, introduced by Sotiriou et al. (2006), contains gene expression data on patients suffering from breast cancer. ...Gene Ontology (Ashburner et al., 2000), which was used as the background knowledge in our experiments.
Dataset Splits No The paper mentions subsets of genes (e.g., "1,000 enriched genes... from a set of 10,000 genes" and "990 interesting genes out of a total of 12,019 genes") and distinguishes between positive and negative examples, but it does not specify explicit training, validation, or test dataset splits, percentages, or methodology for reproducibility beyond these overall descriptions of target sets.
Hardware Specification Yes We timed the algorithm on the ALL data set using different settings for the beam, depth and support on 8 core 2.60 GHz Intel Xeon(R)E5-2697 v3 machine with 64GB of RAM.
Software Dependencies No The paper mentions the use of Hedwig and Aleph algorithms, and refers to the 'Aleph Manual, 1999'. However, it does not provide specific version numbers for the implementations of these algorithms or any other key software libraries used in their experiments.
Experiment Setup Yes Using Hedwig, we ran the algorithm with all combinations of depth (1 or 10), beam width (1 or 10) and support (0.1 or 0.01). For Aleph, we ran the algorithm using the settings recommended by the algorithm author minimum number of positive examples covered by a rule was set to 10, and maximum number of negative examples covered by a rule was set to 100.