SAFE-NID: Self-Attention with Normalizing-Flow Encodings for Network Intrusion Detection

Authors: Brian Matejek, Ashish Gehani, Nathaniel D. Bastian, Daniel J Clouse, Bradford J Kline, Susmit Jha

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our approach by converting publicly available network flow-level intrusion datasets into packet-level ones. We release the labeled packet-level versions of these datasets with over 50 million packets each and describe the challenges in creating these datasets. We withhold from the training data certain attack categories to simulate zero-day attacks. Existing deep learning models, which achieve an accuracy of over 99% when detecting known attacks, only correctly classify 1% of the novel attacks. Our proposed transformer architecture with normalizing flows model safeguard achieves an area under the receiver operating characteristic curve of over 0.97 in detecting these novel inputs, outperforming existing combinations of neural architectures and model safeguards. The additional latency in processing each packet by the safeguard is a small fraction of the overall inference task.
Researcher Affiliation Collaboration Brian Matejek EMAIL Computer Science Laboratory, SRI International Ashish Gehani EMAIL Computer Science Laboratory, SRI International Nathaniel D. Bastian EMAIL Army Cyber Institute, United States Military Academy Daniel J. Clouse Laboratory for Advanced Cybersecurity Research, Department of Defense Bradford Kline Laboratory for Advanced Cybersecurity Research, Department of Defense Susmit Jha EMAIL Computer Science Laboratory, SRI International
Pseudocode No The paper describes the technical approach and models in prose and mathematical equations (e.g., equations 1-5 for normalizing flows) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We make our processed packet-level dataset freely available and our code open source to encourage further research on packet-level network intrusion detection.1 1https://github.com/SRI-CSL/trinity-packet
Open Datasets Yes We convert publicly available network flow-level intrusion datasets into packet-level ones. We release the labeled packet-level versions of these datasets with over 50 million packets each and describe the challenges in creating these datasets. ... We create and release a packet-level network intrusion detection dataset extracted from the flow-level datasets (Sharafaldin et al., 2018b; Moustafa & Slay, 2015), and make these available to the wider research community. We publish the code for this process and resultant data files2. 2https://github.com/SRI-CSL/trinity-packet
Dataset Splits Yes We split our data into training/validation and testing sets. For the training and validation data, we take half of the malicious samples and an equal number of benign ones. Of this data, we use 75% for training and 25% for validation. We take ten random splits in this fashion and show averages and standard deviations for all published results. Our test data contains the remaining malicious and benign samples.
Hardware Specification Yes The timing analysis was run on an AMD Ryzen Threadripper PRO 5965WX 24-Cores processor at 1.8 GHz for Gaussian kernel density and on an NVIDIA RTX A6000 for normalizing flows.
Software Dependencies No The paper lists software libraries such as Python, Pytorch, Payload-Byte, Framework for Easily Invertible Architectures (FrEIA), and PyTorch Out-of-Distribution Detection Library, but does not provide specific version numbers for these components.
Experiment Setup Yes Our FNN and CNN baseline models each use batch normalization (Ioffe & Szegedy, 2015) and dropout (p = 0.2) (Srivastava et al., 2014) regularization techniques. Each hidden layer activation function is Leaky RELU (α = 0.01). Both of the classification networks use the AMSGrad variant of the Adam optimizer (Kingma & Ba, 2014; Reddi et al., 2019) with β1 = 0.9, β2 = 0.999, and a learning rate of 1e 4. We use the binary crossentropy loss function. We train each network for 20 epochs and use the weights with the lowest validation loss. For our transformer architecture, we use an embedding size of 384, used previously in sentence transformer sequence to classification tasks (Reimers & Gurevych, 2019). We only stack two transformer blocks, and each block has six self-attention heads for 64-dimensional key, value, and query vectors (Vaswani et al., 2017). Similar to the FNN and CNN models, we use batch normalization (Ioffe & Szegedy, 2015) and dropout (p = 0.2) (Srivastava et al., 2014) regularization techniques for our fully connected layers after the transformer block. We use the AMSGrad variant of the Adam optimizer (Kingma & Ba, 2014; Reddi et al., 2019) with β1 = 0.9, β2 = 0.999, and a learning rate of 3e 4. We use the binary cross-entropy loss function and train each network for six epochs. For our normalizing flow models, we stack 20 Real NVP blocks with an affine clamping α = 2. We learn our parameters for s and t using a simple fully connected network with two hidden layers with 128 features each and Leaky RELU activation with α = 0.01. The input and output dimensions of these learnable blocks are dependent on the size of the extracted NN layers (either 256, 128, or 64 dimensions). We use the Adam optimizer (Kingma & Ba, 2014) with β1 = 0.8, β2 = 0.9, a learning rate of 1e 4, and weight decay of 2e 5. We train each normalizing flow model for 512 epochs. We use 25% of our in-distribution data as validation data, stratifying by packet category.