Memory-Efficient Sequential Pattern Mining with Hybrid Tries
Authors: Amin Hosseininasab, Willem-Jan van Hoeve, Andre A. Cire
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical results on small to medium-sized real-life test instances show an average improvement of 85% in memory consumption and 49% in computation time compared to the state of the art. For large data sets, our algorithm stands out as the only capable SPM approach within 256GB of system memory, potentially saving 1.7TB in memory consumption. 4. Numerical Results For our numerical tests, we evaluated the effect of different data set models on the performance of state-of-the-art mining algorithms in large-scale SPM. |
| Researcher Affiliation | Academia | Amin Hosseininasab EMAIL Warrington College of Business University of Florida Gainesville, FL, USA Willem-Jan van Hoeve EMAIL Tepper School of Business Carnegie Mellon University Pittsburgh, PA, USA Andre A. Cire EMAIL Rotman School of Management University of Toronto Toronto, ON, Canada |
| Pseudocode | Yes | Algorithm 1 Construction of BT . Algorithm 2 The BT Miner algorithm. Algorithm 3 Construction of HT . Algorithm 4 The HT Miner algorithm. |
| Open Source Code | Yes | 1. Our algorithms are open source and available at https://github.com/aminhn/BDTrie |
| Open Datasets | Yes | We used six real-life data sets in our numerical tests, given in Table 2. ... The large data sets include the Criteo 1TB click-stream data set (Criteo AI Lab, 2014), and the amino acid representation of the 1000 Genomes data set (Consortium et al., 2015). ... The medium data sets include the Twitch data set (Rappaz et al., 2021), and the Spotify data set (Brost et al., 2019). The small data sets include Kosarak and MSNBC, which have been benchmark clickstream instances for SPM since the early 2010s (Fournier-Viger et al., 2016). |
| Dataset Splits | No | The paper uses subsets of data (e.g., '15% of the Criteo data set', '62% of the Genome data set') and applies various 'minimum support thresholds' for evaluation. However, it does not specify how these datasets are formally split into training, validation, or test sets for reproducibility purposes. |
| Hardware Specification | Yes | All algorithms were coded in C++ and executed on a PC with an Intel Xeon W-2255 processor, 256GB of memory, and Ubuntu 20.04.1 operating system. |
| Software Dependencies | No | All algorithms were coded in C++ and executed on a PC with an Intel Xeon W-2255 processor, 256GB of memory, and Ubuntu 20.04.1 operating system. The paper mentions the programming language (C++) and operating system (Ubuntu 20.04.1) but does not provide specific versions of libraries or other software components. |
| Experiment Setup | Yes | We limit all tests to one core of the CPU and a 36,000-second time limit. The thresholds and number of frequent patterns found for each data set are given in Table 3. |