Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jun 12;22(6):649.
doi: 10.3390/e22060649.

Anomaly Detection for Individual Sequences with Applications in Identifying Malicious Tools

Affiliations

Anomaly Detection for Individual Sequences with Applications in Identifying Malicious Tools

Shachar Siboni et al. Entropy (Basel). .

Abstract

Anomaly detection refers to the problem of identifying abnormal behaviour within a set of measurements. In many cases, one has some statistical model for normal data, and wishes to identify whether new data fit the model or not. However, in others, while there are normal data to learn from, there is no statistical model for this data, and there is no structured parameter set to estimate. Thus, one is forced to assume an individual sequences setup, where there is no given model or any guarantee that such a model exists. In this work, we propose a universal anomaly detection algorithm for one-dimensional time series that is able to learn the normal behaviour of systems and alert for abnormalities, without assuming anything on the normal data, or anything on the anomalies. The suggested method utilizes new information measures that were derived from the Lempel-Ziv (LZ) compression algorithm in order to optimally and efficiently learn the normal behaviour (during learning), and then estimate the likelihood of new data (during operation) and classify it accordingly. We apply the algorithm to key problems in computer security, as well as a benchmark anomaly detection data set, all using simple, single-feature time-indexed data. The first is detecting Botnets Command and Control (C&C) channels without deep inspection. We then apply it to the problems of malicious tools detection via system calls monitoring and data leakage identification.We conclude with the New York City (NYC) taxi data. Finally, while using information theoretic tools, we show that an attacker's attempt to maliciously fool the detection system by trying to generate normal data is bound to fail, either due to a high probability of error or because of the need for huge amounts of resources.

Keywords: NYC taxi data; anomaly detection; botnets; command and control channels; computer security; individual sequences; learning; one-dimensional time series; probability assignment; statistical model; universal compression.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Figures

Figure 1
Figure 1
A statistical Model for sequence “aabdbbacbbda”. Each node in the tree is represented by the 3-tuple {symbol, counter, probability}. The probabilities of edges connected directly to the root are equal to the appropriate root-children’s counter divided by the total number of leaf-nodes, i, at each step of the algorithm.
Figure 2
Figure 2
A Classification Model based on the LZ78 universal compression algorithm and its associated probability assignment.
Figure 3
Figure 3
Testing ‘Majority Vote Classification’ using TD feature and ‘Uniform’ quantization method by considering ‘Clients’ type of flows. (Left): Receiver Operating Curve; (Right): Zoom in on the upper left corner. First, each testing sequence partitioned into several sets of subsequences, denoted as #Subseq in the graph, and the decision made per set of subsequences, where better results were achieved for higher subsequences definition.
Figure 4
Figure 4
(Left) Testing ‘Training Modes’: Semisupervised-Negative, Semisupervised-Positive and Unsupervised modes, using TD feature and ‘Uniform’ quantization method with respect to ‘Hosts’ type of flows. The classifier achieves the best results of AUC = 0.998 with 100% detection and 3.51641% false alarms for Semisupervised-Negative training mode and the worst results of AUC = 0.219 and ∼98% false alarms for 100% detection in the case of Semisupervised-Positive training mode. (Right) The effect of the number of quantization levels on performance. QL refers to the number of centroids used. QL =10 achieved the best results in terms of the area under the curve, with AUC = 0.992, as depicted by the blue line in the figure.
Figure 5
Figure 5
A simple example of time-differences (TDi) for both normal and anomalous sequences. In this simple example, the normal traffic is characterized by variable values, which reflect standard network traffic. For example, a user surfing the web. The anomalous data are characterized by fixed values. This reflects the behaviour of a simple C&C channel, where the bots connect on specific times, for a specific time frame. For more complex anomalous behaviour, see also Section 6.
Figure 6
Figure 6
Threshold analysis. (Left) Probabilities of training sequences and the thresholds used from the testing phase, referred as Tr (in blue and orange). X axis refers to sequence numbers, while Y depicts the probability on a log10 scale. The threshold levels decided with labeled data are between the μ+σ and μ+2σ levels (light blue and green), where μ and σ are the mean and standard deviation of the probabilities from training alone. (Right) Histogram of the differences between the probabilities of training sequences and the obtained threshold from the testing phase (marked in red line at 0). A false alarm rate of 5.101% is obtained while using this threshold.
Figure 7
Figure 7
KL distances between the learned histogram of normal behaviour of firefox.exe and the histograms created every two minutes in the testing phase of the same process, as a function of time. The two gray vertical lines mark the time when “Zeus” was active.
Figure 8
Figure 8
Anomaly detection results for the benchmark file nyc_taxi.csv. In blue bars—the number of taxi passengers every 30 min throughout a 6 months period. There data includes 5 known-cause anomalies: the NYC marathon (2 November 2014), Thanksgiving (27 November 2014), Christmas (25 December 2014), New Years day (1 January 2015), and a strong New England blizzard (27 January 2015). All 5 were correctly identified (orange), together with 4 false alarms (gray).

References

    1. Strayer W.T., Lapsely D., Walsh R., Livadas C. Botnet Detection. Springer; Boston, MA, USA: 2008. Botnet detection based on network behavior; pp. 1–24.
    1. Gu G., Zhang J., Lee W. BotSniffer: Detecting botnet command and control channels in network traffic; Proceedings of the 15th Annual Network and Distributed System Security Symposium; San Diego, CA, USA. 10–13 February 2008.
    1. Chang S., Daniels T.E. P2P botnet detection using behavior clustering &amp statistical tests; Proceedings of the 2nd ACM Workshop on Security and Artificial Intelligence; Chicago, IL, USA. 9 November 2009; New York, NY, USA: ACM; 2009. pp. 23–30.
    1. Noh S.K., Oh J.H., Lee J.S., Noh B.N., Jeong H.C. Detecting P2P botnets using a multi-phased flow model; Proceedings of the 2009 Third International Conference on Digital Society, ICDS’09; Cancun, Mexico. 1–7 February 2009; Piscataway, NJ, USA: IEEE; 2009. pp. 247–253.
    1. Francois J., Wang S., Bronzi W., State R., Engel T. BotCloud: Detecting botnets using MapReduce; Proceedings of the 2011 IEEE International Workshop on Information Forensics and Security; Iguacu Falls, Brazil. 29 November–2 December 2011; Piscataway, NJ, USA: IEEE; 2011. pp. 1–6.

LinkOut - more resources