reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Harry: A Tool for Measuring String Similarity

Authors: Konrad Rieck, Christian Wressnegger

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efficiency of Harry in an empirical evaluation, where we first study its scalability (Section 3.1) and then compare its performance to related tools (Section 3.2). In our first experiment, we compute the Levenshtein (1966) distance... We repeat the computation with a different number of available CPU cores and measure the run-time in terms of comparisons per second. Figure 1 shows the results of this experiment. In the second experiment, we compare Harry with other tools for measuring string similarity. For each of the four data sets, we randomly draw 100 strings, compute a full similarity matrix with each tool and measure the run-time over 10 runs. The averaged results for the comparative evaluation are shown in Figure 2.
Researcher Affiliation	Academia	Konrad Rieck EMAIL Christian Wressnegger EMAIL University of Göttingen Goldschmidtstraße 7 37077 Göttingen, Germany
Pseudocode	No	The paper describes methods and experiments but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	The source code of Harry along with documentation and a tutorial is available at http://www.mlsec.org/harry.
Open Datasets	Yes	In all experiments we consider the data sets listed in Table 2 which contain strings of DNA snippets, protein sequences, Twitter messages and network traces, respectively. ARTS (Sonnenburg et al., 2006); SPROT (O Donovan et al., 2002); TWEETS (Twitter.com); WEBFP (Cai et al., 2012).
Dataset Splits	No	The paper states: 'Each data set consists of 1,000 strings randomly drawn from the original source' and 'For each of the four data sets, we randomly draw 100 strings, compute a full similarity matrix with each tool and measure the run-time over 10 runs.' While it mentions random drawing of strings, it does not specify explicit training/test/validation splits, percentages, or methodology for partitioning the data into distinct sets for model evaluation or training, as would typically be required for reproduction.
Hardware Specification	No	The paper mentions: 'Harry makes use of multi-threading and distributes the workload over multiple CPU cores (see option -n).' and 'We repeat the computation with a different number of available CPU cores'. However, it does not specify any particular CPU model, speed, memory, or other specific hardware components used in the experiments.
Software Dependencies	Yes	We consider the Python modules python-Levenshtein (0.11.2) and python-jellyfish (0.5.0) that implement the Levenshtein distance and its variants, the library Comp Learn (1.1.7) that focuses on compression distances, and the machine learning toolbox Shogun (4.0.0) that provides several string kernels.
Experiment Setup	Yes	In our first experiment, we compute the Levenshtein (1966) distance, the normalized compression distance (Bennett et al., 1998) and the Subsequence kernel (Lodhi et al., 2002) on all four data sets using Harry. We repeat the computation with a different number of available CPU cores and measure the run-time in terms of comparisons per second. In the second experiment, we compare Harry with other tools for measuring string similarity. For each of the four data sets, we randomly draw 100 strings, compute a full similarity matrix with each tool and measure the run-time over 10 runs.