Learning to Identify Concise Regular Expressions that Describe Email Campaigns
Authors: Paul Prasse, Christoph Sawade, Niels Landwehr, Tobias Scheffer
JMLR 2015 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report on a case study conducted with an email service provider. We investigate whether postmasters accept the output of REx-SVM and REx-SVMshort for blacklisting mailing campaigns during regular operations of a commercial email service. We also evaluate how accurately REx-SVM and REx-SVMshort and their reference methods identify the extensions of mailing campaigns. Figure 7 shows the true- and false-positive rates for all methods on both data sets. The horizontal axis displays the number of emails in the input batch x. Error bars indicate the standard error. |
| Researcher Affiliation | Collaboration | Paul Prasse EMAIL University of Potsdam, Department of Computer Science August-Bebel-Strasse 89, 14482 Potsdam, Germany Christoph Sawade EMAIL Sound Cloud Ltd. Rheinsberger Str. 76/77, 10115 Berlin, Germany Niels Landwehr EMAIL University of Potsdam, Department of Computer Science August-Bebel-Strasse 89, 14482 Potsdam, Germany Tobias Scheffer EMAIL University of Potsdam, Department of Computer Science August-Bebel-Strasse 89, 14482 Potsdam, Germany |
| Pseudocode | Yes | Algorithm 1 Constructing the decoding space Algorithm 2 Most strongly violated constraint |
| Open Source Code | No | The paper does not explicitly state that the source code for their methodology is made publicly available, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | Additionally, we use a public data set that consists of 100 batches of emails extracted from the Bruce Guenther archive1, containing a total of 63,512 emails. To measure false-positive rates on this public data set, we use 17,419 non-spam emails from the Enron corpus2 and 76,466 non-spam emails of the TREC corpus3. The public data set is available to other researchers. 1. http://untroubled.org/spam/ 2. http://www.cs.cmu.edu/~enron/ 3. http://trec.nist.gov/data/spam.html |
| Dataset Splits | Yes | We first carry out a leave-one-batch-out cross-validation loop over the 158 labeled batches of the ESP data set. In each iteration, 157 batches are reserved for training of fu. On this training portion of the data, regularization parameter Cu is tuned in a nested 10-fold cross validation loop, then a model is trained on all 157 training batches. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions several algorithms and methods like 'standard cutting plane methods as Pegasos (Shalev-Shwartz et al., 2011) or SVMstruct (Tsochantaridis et al., 2005)', and refers to 'the method of Dub e and Feeley (2000)', but does not provide specific version numbers for any software libraries, frameworks, or tools used in their implementation. |
| Experiment Setup | Yes | We train a first-stage model fu on the 158 labeled batches after tuning regularization parameter Cu with 10-fold cross validation. We tune the regularization parameter Cv using leave-one-out cross validation and train a global model fv that is used in the following experiments. |