Harry: A Tool for Measuring String Similarity

Authors: Konrad Rieck, Christian Wressnegger

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficiency of Harry in an empirical evaluation, where we first study its scalability (Section 3.1) and then compare its performance to related tools (Section 3.2). In our first experiment, we compute the Levenshtein (1966) distance... We repeat the computation with a different number of available CPU cores and measure the run-time in terms of comparisons per second. Figure 1 shows the results of this experiment. In the second experiment, we compare Harry with other tools for measuring string similarity. For each of the four data sets, we randomly draw 100 strings, compute a full similarity matrix with each tool and measure the run-time over 10 runs. The averaged results for the comparative evaluation are shown in Figure 2.
Researcher Affiliation Academia Konrad Rieck EMAIL Christian Wressnegger EMAIL University of Göttingen Goldschmidtstraße 7 37077 Göttingen, Germany
Pseudocode No The paper describes methods and experiments but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes The source code of Harry along with documentation and a tutorial is available at http://www.mlsec.org/harry.
Open Datasets Yes In all experiments we consider the data sets listed in Table 2 which contain strings of DNA snippets, protein sequences, Twitter messages and network traces, respectively. ARTS (Sonnenburg et al., 2006); SPROT (O Donovan et al., 2002); TWEETS (Twitter.com); WEBFP (Cai et al., 2012).
Dataset Splits No The paper states: 'Each data set consists of 1,000 strings randomly drawn from the original source' and 'For each of the four data sets, we randomly draw 100 strings, compute a full similarity matrix with each tool and measure the run-time over 10 runs.' While it mentions random drawing of strings, it does not specify explicit training/test/validation splits, percentages, or methodology for partitioning the data into distinct sets for model evaluation or training, as would typically be required for reproduction.
Hardware Specification No The paper mentions: 'Harry makes use of multi-threading and distributes the workload over multiple CPU cores (see option -n).' and 'We repeat the computation with a different number of available CPU cores'. However, it does not specify any particular CPU model, speed, memory, or other specific hardware components used in the experiments.
Software Dependencies Yes We consider the Python modules python-Levenshtein (0.11.2) and python-jellyfish (0.5.0) that implement the Levenshtein distance and its variants, the library Comp Learn (1.1.7) that focuses on compression distances, and the machine learning toolbox Shogun (4.0.0) that provides several string kernels.
Experiment Setup Yes In our first experiment, we compute the Levenshtein (1966) distance, the normalized compression distance (Bennett et al., 1998) and the Subsequence kernel (Lodhi et al., 2002) on all four data sets using Harry. We repeat the computation with a different number of available CPU cores and measure the run-time in terms of comparisons per second. In the second experiment, we compare Harry with other tools for measuring string similarity. For each of the four data sets, we randomly draw 100 strings, compute a full similarity matrix with each tool and measure the run-time over 10 runs.