reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Memory-Efficient Sequential Pattern Mining with Hybrid Tries

Authors: Amin Hosseininasab, Willem-Jan van Hoeve, Andre A. Cire

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical results on small to medium-sized real-life test instances show an average improvement of 85% in memory consumption and 49% in computation time compared to the state of the art. For large data sets, our algorithm stands out as the only capable SPM approach within 256GB of system memory, potentially saving 1.7TB in memory consumption. 4. Numerical Results For our numerical tests, we evaluated the eﬀect of diﬀerent data set models on the performance of state-of-the-art mining algorithms in large-scale SPM.
Researcher Affiliation	Academia	Amin Hosseininasab EMAIL Warrington College of Business University of Florida Gainesville, FL, USA Willem-Jan van Hoeve EMAIL Tepper School of Business Carnegie Mellon University Pittsburgh, PA, USA Andre A. Cire EMAIL Rotman School of Management University of Toronto Toronto, ON, Canada
Pseudocode	Yes	Algorithm 1 Construction of BT . Algorithm 2 The BT Miner algorithm. Algorithm 3 Construction of HT . Algorithm 4 The HT Miner algorithm.
Open Source Code	Yes	1. Our algorithms are open source and available at https://github.com/aminhn/BDTrie
Open Datasets	Yes	We used six real-life data sets in our numerical tests, given in Table 2. ... The large data sets include the Criteo 1TB click-stream data set (Criteo AI Lab, 2014), and the amino acid representation of the 1000 Genomes data set (Consortium et al., 2015). ... The medium data sets include the Twitch data set (Rappaz et al., 2021), and the Spotify data set (Brost et al., 2019). The small data sets include Kosarak and MSNBC, which have been benchmark clickstream instances for SPM since the early 2010s (Fournier-Viger et al., 2016).
Dataset Splits	No	The paper uses subsets of data (e.g., '15% of the Criteo data set', '62% of the Genome data set') and applies various 'minimum support thresholds' for evaluation. However, it does not specify how these datasets are formally split into training, validation, or test sets for reproducibility purposes.
Hardware Specification	Yes	All algorithms were coded in C++ and executed on a PC with an Intel Xeon W-2255 processor, 256GB of memory, and Ubuntu 20.04.1 operating system.
Software Dependencies	No	All algorithms were coded in C++ and executed on a PC with an Intel Xeon W-2255 processor, 256GB of memory, and Ubuntu 20.04.1 operating system. The paper mentions the programming language (C++) and operating system (Ubuntu 20.04.1) but does not provide specific versions of libraries or other software components.
Experiment Setup	Yes	We limit all tests to one core of the CPU and a 36,000-second time limit. The thresholds and number of frequent patterns found for each data set are given in Table 3.