PACKETCLIP: multi-modal embedding of network traffic and language for cybersecurity reasoning

Ryozo Masukawa¹, Sanggeon Yun¹, Sungheon Jeong¹, Wenjun Huang¹, Yang Ni¹, Ian Bryant¹, Nathaniel D Bastian², Mohsen Imani¹

Affiliations

¹ Department of Computer Science, University of California, Irvine, Irvine, CA, United States.
² Department of Electrical Engineering & Computer Science, United States Military Academy, West Point, NY, United States.

PMID: 40791311
PMCID: PMC12336109
DOI: 10.3389/frai.2025.1593944

PACKETCLIP: multi-modal embedding of network traffic and language for cybersecurity reasoning

Ryozo Masukawa et al. Front Artif Intell. 2025.

. 2025 Jul 28:8:1593944.

doi: 10.3389/frai.2025.1593944. eCollection 2025.

Authors

Ryozo Masukawa¹, Sanggeon Yun¹, Sungheon Jeong¹, Wenjun Huang¹, Yang Ni¹, Ian Bryant¹, Nathaniel D Bastian², Mohsen Imani¹

Affiliations

¹ Department of Computer Science, University of California, Irvine, Irvine, CA, United States.
² Department of Electrical Engineering & Computer Science, United States Military Academy, West Point, NY, United States.

PMID: 40791311
PMCID: PMC12336109
DOI: 10.3389/frai.2025.1593944

Abstract

Traffic classification is vital for cybersecurity, yet encrypted traffic poses significant challenges. We introduce PACKETCLIP which is a multi-modal framework combining packet data with natural language semantics through contrastive pre-training and hierarchical Graph Neural Network (GNN) reasoning. PACKETCLIP integrates semantic reasoning with efficient classification, enabling robust detection of anomalies in encrypted network flows. By aligning textual descriptions with packet behaviors, PACKETCLIP offers enhanced interpretability, scalability, and practical applicability across diverse security scenarios. With a 95% mean AUC, an 11.6% improvement over baselines, and a 92% reduction in intrusion detection training parameters, it is ideally suited for real-time anomaly detection. By bridging advanced machine-learning techniques and practical cybersecurity needs, PACKETCLIP provides a foundation for scalable, efficient, and interpretable solutions to tackle encrypted traffic classification and network intrusion detection challenges in resource-constrained environments.

Keywords: contrastive pre-training; graph neural network; machine learning; multimodal; reasoning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Flowchart illustrating a system for detecting DoS attacks with Tiny AI. Steps include: a) defining the mission, b) using a knowledge graph for task abstraction, c) aligning packets and text using PacketCLIP, and d) performing hierarchical reasoning to detect DoS attacks with 92% confidence. e) Demonstrates fitting into low SWaP devices. Features highlighted are traceable intelligence, human-AI collaboration, interpretable reasoning, and Tiny AI compatibility. — **Figure 1**
Semantic AI framework for detecting traffic related to specific cyber-attacks defined by a user **(a)**, combining LLM-driven knowledge graphs **(b)**, PACKETCLIP alignment **(c)**, hierarchical reasoning **(d)**, and Tiny AI to enable efficient, interpretable, and traceable detection on low-resource devices **(e)**.

Flowchart illustrating a process for generating augmented text explanations for security data. It consists of five steps: 1) Security flow data collection, 2) Template-based text conversion, 3) Sampling words from a knowledge graph, 4) Large language model (LLM) paraphrasing, and 5) Creating augmented rich text explanations. The process starts with security data, which is converted into templates, enriched with sampled words, paraphrased by an LLM, and finally results in detailed text explanations on network security events. Arrows indicate the flow of information and interactions between components. — **Figure 2**
A framework to generate NL explanations for intrusion scenarios by mapping tabular security flow data (1) to text templates (2), leveraging LLM-generated knowledge graphs (3), utilizing LLMs for paraphrased explanations (4), and producing interpretable descriptions of network events (5).

Text screenshot of cybersecurity task prompts and example outputs. The system prompt instructs explaining cybersecurity incidents. The user prompt requests a paraphrase in one sentence. Example outputs describe network traffic packets, with an anomalous packet for reconnaissance and a benign packet identified as non-malicious. — **Figure 3**
LLM prompt and sample outputs illustrating paraphrasing and concise explanation of cybersecurity network traffic incidents.

Diagram illustrating two parts: (a) Contrastive pretraining for PacketCLIP, showing a flowchart with packet data encoding and SSL head processing, and (b) Hierarchical GNN Training, displaying a network model with nodes for network access, credential theft, and embeddings, emphasizing hierarchical message passing. — **Figure 4**
**(a)** The overall architecture of the contrastive pre-training process for PACKETCLIP, including encoding packets and paired texts for learning. **(b)** A mission-specific hierarchical GNN framework that integrates PACKETCLIP with temporal models and classifiers to derive intrusion detection results.

Bar chart showing data proportions for different class labels except for benign. DNS Flood has the highest proportion at 2.5%, followed by Slowloris and SYN Flood. Other class labels include Dictionary Attack, Port Scan, and Vulnerability Scan, each with lower proportions. — **Figure 5**
Proportion of Class labels.

Three panels represent different security challenges: DoS, Brute Force, and Reconnaissance. Each panel includes a word cloud and a bar chart. The DoS panel highlights terms like “botnet” and “mitigation” with frequency bars for terms like “protection” and “detection”. The Brute Force panel features “credential” and “stuffing”, with bars for “security” and “detection.” The Reconnaissance panel emphasizes “port” and “scanning,” with bars for “port” and “network. — **Figure 6**
” Word clouds and top 10 frequent vocabularies for DoS, Brute Force, and Reconnaissance missions from ACI-IoT-2023, highlighting key terms like “botnet,” “credential,” and “port scanning” for respective categories.

Line graph showing zero-shot accuracy vs. training steps for two encoders. Top-1 and top-5 accuracies are measured using SSL head. Both encoders show increasing accuracy with training, peaking around 80%–90%. The top-5 accuracy on the packet encoder dips significantly between 4,000 and 6,000 steps. — **Figure 7**
Zero-shot accuracy change during training shows a trade-off: SSL on both encoders improves faster but is less stable, while SSL only on the packet encoder progresses slower but is more stable.

Bar graphs comparing different models. In graph (a), ET-BERT has the highest FLOPs, followed by TFE-GNN, CLE-TFE, and GNN Reasoning, all with similar lower values. In graph (b), ET-BERT has the most parameters, significantly more than the other three models which have nearly equal and lower parameter counts, with a difference of 107 million highlighted. — **Figure 8**
**(a)** Comparison of models based on FLOPs ( × 10⁶ Log-scale) and **(b)** on the number of parameters (M) for training.

Line chart comparing the performance of “PacketCLIP + GNN Reasoning” and “ET-BERT” models in terms of mean AUC against the portion of training data. “PacketCLIP + GNN Reasoning” maintains a steady performance around 96% mean AUC across data portions, while “ET-BERT” shows a decline from 63% at 100% data to below 40% at 30% data. A visual marker indicates a 50% difference at 30% data. — **Figure 9**
Robustness against Data Scarcity Analysis by mAUC Comparison of PACKETCLIP + GNN Reasoning and *ET-BERT* with Varying Training Data (100%, 70%, 50%, 40%, 30%).

See this image and copyright information in PMC

References

1. Achiam J., Adler S., Agarwal S., Ahmad L., Akkaya I., Aleman F. L., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
1. Ajagbe S. A., Awotunde J. B., Florez H. (2024). Intrusion detection: a comparison study of machine learning models using unbalanced dataset. SN Comput. Sci. 5:1028. 10.1007/s42979-024-03369-0 - DOI
1. Alrahis L., Knechtel J., Sinanoglu O. (2023). “Graph neural networks: a powerful and versatile tool for advancing design, reliability, and security of ICS,” in Proceedings of the 28th Asia and South Pacific Design Automation Conference, 83–90. 10.1145/3566097.3568345 - DOI
1. Bastian N., Bierbrauer D., McKenzie M., Nack E. (2023). ACI IoT Network Traffic Dataset 2023. 10.21227/qacj-3x32 - DOI
1. Bhavsar M., Roy K., Kelly J., Olusola O. (2023). Anomaly-based intrusion detection system for iot application. Discover Internet Things 3:5. 10.1007/s43926-023-00034-5 - DOI

LinkOut - more resources

Full Text Sources
- Frontiers Media SA
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PACKETCLIP: multi-modal embedding of network traffic and language for cybersecurity reasoning

Affiliations

PACKETCLIP: multi-modal embedding of network traffic and language for cybersecurity reasoning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

LinkOut - more resources

Full Text Sources