Learning to Identify Concise Regular Expressions that Describe Email Campaigns

Authors: Paul Prasse, Christoph Sawade, Niels Landwehr, Tobias Scheffer

JMLR 2015 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report on a case study conducted with an email service provider. We investigate whether postmasters accept the output of REx-SVM and REx-SVMshort for blacklisting mailing campaigns during regular operations of a commercial email service. We also evaluate how accurately REx-SVM and REx-SVMshort and their reference methods identify the extensions of mailing campaigns. Figure 7 shows the true- and false-positive rates for all methods on both data sets. The horizontal axis displays the number of emails in the input batch x. Error bars indicate the standard error.
Researcher Affiliation Collaboration Paul Prasse EMAIL University of Potsdam, Department of Computer Science August-Bebel-Strasse 89, 14482 Potsdam, Germany Christoph Sawade EMAIL Sound Cloud Ltd. Rheinsberger Str. 76/77, 10115 Berlin, Germany Niels Landwehr EMAIL University of Potsdam, Department of Computer Science August-Bebel-Strasse 89, 14482 Potsdam, Germany Tobias Scheffer EMAIL University of Potsdam, Department of Computer Science August-Bebel-Strasse 89, 14482 Potsdam, Germany
Pseudocode Yes Algorithm 1 Constructing the decoding space Algorithm 2 Most strongly violated constraint
Open Source Code No The paper does not explicitly state that the source code for their methodology is made publicly available, nor does it provide a direct link to a code repository.
Open Datasets Yes Additionally, we use a public data set that consists of 100 batches of emails extracted from the Bruce Guenther archive1, containing a total of 63,512 emails. To measure false-positive rates on this public data set, we use 17,419 non-spam emails from the Enron corpus2 and 76,466 non-spam emails of the TREC corpus3. The public data set is available to other researchers. 1. http://untroubled.org/spam/ 2. http://www.cs.cmu.edu/~enron/ 3. http://trec.nist.gov/data/spam.html
Dataset Splits Yes We first carry out a leave-one-batch-out cross-validation loop over the 158 labeled batches of the ESP data set. In each iteration, 157 batches are reserved for training of fu. On this training portion of the data, regularization parameter Cu is tuned in a nested 10-fold cross validation loop, then a model is trained on all 157 training batches.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, or memory specifications.
Software Dependencies No The paper mentions several algorithms and methods like 'standard cutting plane methods as Pegasos (Shalev-Shwartz et al., 2011) or SVMstruct (Tsochantaridis et al., 2005)', and refers to 'the method of Dub e and Feeley (2000)', but does not provide specific version numbers for any software libraries, frameworks, or tools used in their implementation.
Experiment Setup Yes We train a first-stage model fu on the 158 labeled batches after tuning regularization parameter Cu with 10-fold cross validation. We tune the regularization parameter Cv using leave-one-out cross validation and train a global model fv that is used in the following experiments.