Harry: A Tool for Measuring String Similarity
Authors: Konrad Rieck, Christian Wressnegger
JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficiency of Harry in an empirical evaluation, where we first study its scalability (Section 3.1) and then compare its performance to related tools (Section 3.2). In our first experiment, we compute the Levenshtein (1966) distance... We repeat the computation with a different number of available CPU cores and measure the run-time in terms of comparisons per second. Figure 1 shows the results of this experiment. In the second experiment, we compare Harry with other tools for measuring string similarity. For each of the four data sets, we randomly draw 100 strings, compute a full similarity matrix with each tool and measure the run-time over 10 runs. The averaged results for the comparative evaluation are shown in Figure 2. |
| Researcher Affiliation | Academia | Konrad Rieck EMAIL Christian Wressnegger EMAIL University of Göttingen Goldschmidtstraße 7 37077 Göttingen, Germany |
| Pseudocode | No | The paper describes methods and experiments but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | The source code of Harry along with documentation and a tutorial is available at http://www.mlsec.org/harry. |
| Open Datasets | Yes | In all experiments we consider the data sets listed in Table 2 which contain strings of DNA snippets, protein sequences, Twitter messages and network traces, respectively. ARTS (Sonnenburg et al., 2006); SPROT (O Donovan et al., 2002); TWEETS (Twitter.com); WEBFP (Cai et al., 2012). |
| Dataset Splits | No | The paper states: 'Each data set consists of 1,000 strings randomly drawn from the original source' and 'For each of the four data sets, we randomly draw 100 strings, compute a full similarity matrix with each tool and measure the run-time over 10 runs.' While it mentions random drawing of strings, it does not specify explicit training/test/validation splits, percentages, or methodology for partitioning the data into distinct sets for model evaluation or training, as would typically be required for reproduction. |
| Hardware Specification | No | The paper mentions: 'Harry makes use of multi-threading and distributes the workload over multiple CPU cores (see option -n).' and 'We repeat the computation with a different number of available CPU cores'. However, it does not specify any particular CPU model, speed, memory, or other specific hardware components used in the experiments. |
| Software Dependencies | Yes | We consider the Python modules python-Levenshtein (0.11.2) and python-jellyfish (0.5.0) that implement the Levenshtein distance and its variants, the library Comp Learn (1.1.7) that focuses on compression distances, and the machine learning toolbox Shogun (4.0.0) that provides several string kernels. |
| Experiment Setup | Yes | In our first experiment, we compute the Levenshtein (1966) distance, the normalized compression distance (Bennett et al., 1998) and the Subsequence kernel (Lodhi et al., 2002) on all four data sets using Harry. We repeat the computation with a different number of available CPU cores and measure the run-time in terms of comparisons per second. In the second experiment, we compare Harry with other tools for measuring string similarity. For each of the four data sets, we randomly draw 100 strings, compute a full similarity matrix with each tool and measure the run-time over 10 runs. |