Anomaly Detection for Individual Sequences with Applications in Identifying Malicious Tools

Shachar Siboni¹, Asaf Cohen²

Affiliations

¹ Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel.
² School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel.

PMID: 33286421
PMCID: PMC7517183
DOI: 10.3390/e22060649

Anomaly Detection for Individual Sequences with Applications in Identifying Malicious Tools

Shachar Siboni et al. Entropy (Basel). 2020.

. 2020 Jun 12;22(6):649.

doi: 10.3390/e22060649.

Authors

Shachar Siboni¹, Asaf Cohen²

Affiliations

¹ Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel.
² School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel.

PMID: 33286421
PMCID: PMC7517183
DOI: 10.3390/e22060649

Abstract

Anomaly detection refers to the problem of identifying abnormal behaviour within a set of measurements. In many cases, one has some statistical model for normal data, and wishes to identify whether new data fit the model or not. However, in others, while there are normal data to learn from, there is no statistical model for this data, and there is no structured parameter set to estimate. Thus, one is forced to assume an individual sequences setup, where there is no given model or any guarantee that such a model exists. In this work, we propose a universal anomaly detection algorithm for one-dimensional time series that is able to learn the normal behaviour of systems and alert for abnormalities, without assuming anything on the normal data, or anything on the anomalies. The suggested method utilizes new information measures that were derived from the Lempel-Ziv (LZ) compression algorithm in order to optimally and efficiently learn the normal behaviour (during learning), and then estimate the likelihood of new data (during operation) and classify it accordingly. We apply the algorithm to key problems in computer security, as well as a benchmark anomaly detection data set, all using simple, single-feature time-indexed data. The first is detecting Botnets Command and Control (C&C) channels without deep inspection. We then apply it to the problems of malicious tools detection via system calls monitoring and data leakage identification.We conclude with the New York City (NYC) taxi data. Finally, while using information theoretic tools, we show that an attacker's attempt to maliciously fool the detection system by trying to generate normal data is bound to fail, either due to a high probability of error or because of the need for huge amounts of resources.

Keywords: NYC taxi data; anomaly detection; botnets; command and control channels; computer security; individual sequences; learning; one-dimensional time series; probability assignment; statistical model; universal compression.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Figures

**Figure 1**
A statistical Model for sequence “aabdbbacbbda”. Each node in the tree is represented by the 3-tuple {symbol, counter, probability}. The probabilities of edges connected directly to the root are equal to the appropriate root-children’s counter divided by the total number of leaf-nodes, i, at each step of the algorithm.

**Figure 2**
A Classification Model based on the LZ78 universal compression algorithm and its associated probability assignment.

**Figure 3**
Testing ‘Majority Vote Classification’ using TD feature and ‘Uniform’ quantization method by considering ‘Clients’ type of flows. (**Left**): Receiver Operating Curve; (**Right**): Zoom in on the upper left corner. First, each testing sequence partitioned into several sets of subsequences, denoted as $# S u b s e q$ in the graph, and the decision made per set of subsequences, where better results were achieved for higher subsequences definition.

**Figure 4**
(**Left**) Testing ‘Training Modes’: Semisupervised-Negative, Semisupervised-Positive and Unsupervised modes, using TD feature and ‘Uniform’ quantization method with respect to ‘Hosts’ type of flows. The classifier achieves the best results of AUC = 0.998 with 100% detection and 3.51641% false alarms for Semisupervised-Negative training mode and the worst results of AUC = 0.219 and ∼98% false alarms for 100% detection in the case of Semisupervised-Positive training mode. (**Right**) The effect of the number of quantization levels on performance. QL refers to the number of centroids used. QL $= 10$ achieved the best results in terms of the area under the curve, with AUC = 0.992, as depicted by the blue line in the figure.

**Figure 5**
A simple example of time-differences ( $T D_{i}$ ) for both normal and anomalous sequences. In this simple example, the normal traffic is characterized by variable values, which reflect standard network traffic. For example, a user surfing the web. The anomalous data are characterized by fixed values. This reflects the behaviour of a simple C&C channel, where the bots connect on specific times, for a specific time frame. For more complex anomalous behaviour, see also Section 6.

**Figure 6**
Threshold analysis. (**Left**) Probabilities of training sequences and the thresholds used from the testing phase, referred as Tr (in blue and orange). X axis refers to sequence numbers, while Y depicts the probability on a $- {log}_{10}$ scale. The threshold levels decided with labeled data are between the $μ + σ$ and $μ + 2 σ$ levels (light blue and green), where $μ$ and $σ$ are the mean and standard deviation of the probabilities from training alone. (**Right**) Histogram of the differences between the probabilities of training sequences and the obtained threshold from the testing phase (marked in red line at 0). A false alarm rate of 5.101% is obtained while using this threshold.

**Figure 7**
KL distances between the learned histogram of normal behaviour of firefox.exe and the histograms created every two minutes in the testing phase of the same process, as a function of time. The two gray vertical lines mark the time when “Zeus” was active.

**Figure 8**
Anomaly detection results for the benchmark file *nyc_taxi.csv*. In blue bars—the number of taxi passengers every 30 min throughout a 6 months period. There data includes 5 known-cause anomalies: the NYC marathon (2 November 2014), Thanksgiving (27 November 2014), Christmas (25 December 2014), New Years day (1 January 2015), and a strong New England blizzard (27 January 2015). All 5 were correctly identified (orange), together with 4 false alarms (gray).

See this image and copyright information in PMC

References

1. Strayer W.T., Lapsely D., Walsh R., Livadas C. Botnet Detection. Springer; Boston, MA, USA: 2008. Botnet detection based on network behavior; pp. 1–24.
1. Gu G., Zhang J., Lee W. BotSniffer: Detecting botnet command and control channels in network traffic; Proceedings of the 15th Annual Network and Distributed System Security Symposium; San Diego, CA, USA. 10–13 February 2008.
1. Chang S., Daniels T.E. P2P botnet detection using behavior clustering &amp statistical tests; Proceedings of the 2nd ACM Workshop on Security and Artificial Intelligence; Chicago, IL, USA. 9 November 2009; New York, NY, USA: ACM; 2009. pp. 23–30.
1. Noh S.K., Oh J.H., Lee J.S., Noh B.N., Jeong H.C. Detecting P2P botnets using a multi-phased flow model; Proceedings of the 2009 Third International Conference on Digital Society, ICDS’09; Cancun, Mexico. 1–7 February 2009; Piscataway, NJ, USA: IEEE; 2009. pp. 247–253.
1. Francois J., Wang S., Bronzi W., State R., Engel T. BotCloud: Detecting botnets using MapReduce; Proceedings of the 2011 IEEE International Workshop on Information Forensics and Security; Iguacu Falls, Brazil. 29 November–2 December 2011; Piscataway, NJ, USA: IEEE; 2011. pp. 1–6.

Grants and funding

Kabarnit/Office of the Chief Scientist, Ministry of Commerce

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Anomaly Detection for Individual Sequences with Applications in Identifying Malicious Tools

Affiliations

Anomaly Detection for Individual Sequences with Applications in Identifying Malicious Tools

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources