Talking about the relationship between cyber security and machine learning, we need to first identify a concept change. In the past, cyber security focuses on blocking the intruders from outside of our network, but today, we have to believe that intruders are among us. They have invaded our systems and they are doing or going to do damages to us. Whatever the compromised device or machine is doing, it’s acting abnormally. So, cyber security means anomaly detection. Learning about what the machines are normally behaving, we can identify the unusual behaviors, thus find the intruders and terminate them.
First, let’s take a look at the different cyber attacks. The major types of cyber attack that could use some help from machine learning includes:
Malware – they are software installed from attachment in phishing emails, or from web sites with malicious links. Natural language processing can definitely help analyzing the content of text being distributed within the network, block the content and alert the users. Also, malware usually use resources intensively, so they could be pinpoint down by CPU usage monitoring. Installing Anti-Malware software which maintains a library of common files or malicious IP address will help as well.
Zero day attack – Hackers attack computers using vulnerability of software that is unknown to public. To reduce the impact, Patches need to be installed as soon as they are available, and unpatched or newly patched machines must be scanned more frequently. Understanding the nature of the vulnerability, and using that in the feature engineering process makes machine learning more efficient.
APT – Advanced Persistent Thread is the worst of all cyber attacks. Intruders do not make any immediate damages after they compromised a machine, instead, they hide in the network and slowly steal data, affect more machines, and wait for a perfect time to launch attacks. Without analytics, detecting APT is almost impossible.
Academically, using machine learning and deep learning algorithms for anomaly detection has started in the late 1990’s, when, at the time, we don’t even have the term deep learning. It’s neural networks, and we will use the two terms interchangeably in this blog.
So how do we detect anomaly? Before answering this question, we have to define normality. Every machine or device has a regular behavior, which can be analyzed and described using logs or events collected from all the machines. Any activities or sequence of activities that is different from the normal behavior may be an anomaly. We can define the following three kinds of anomalies:
- Point Anomaly – in terms of machine activity, it could mean access to restricted systems, or any summation of behaviors statistically reaching predefined thresholds.
- Contextual Anomaly – Unlike the point anomaly, contextual anomaly may look normal by itself. It’s only by comparing parameters within a timeframe can we find the irregularity.
- Collective Anomaly – For this one, we need to look at a longer timespan and find out a collection of behaviors that doesn’t look normal.
We mentioned feature engineering earlier. Feature engineering means using the domain knowledge of the data to create features that will be used in machine learning. In the domain of cyber security, common features used are:
- CPU usage
- Login time
- All Systems accessed
- File directory
- Amount of data transferred in and out
- Application logs
- Sys logs
- Database logs
Last but not least, let’s review machine learning algorithms for detecting anomalies. According to Chandola, Banerjee, and Kumar in their 2009 Anomaly Detection: A Survey, there are following 6 anomaly detection techniques.
classification technique creates classifiers(models) through the training of labeled data, and then classify the test instances through the learnt models. It could be a single-class classification, where the whole training set has only one normal class. It can also be a multi-class classification, where the training set has more than one normal class.
Common algorithms includes Rule based algorithm, Naive bayesian, Support Vector Machines and Neural Networks. Application of classification based techniques on test instances can be fast and accurate, but it relies heavily on availability of accurate labels for the classes.
Nearest Neighbor based
Nearest Neighbor based anomaly detection techniques assumes normal data occurs in dense neighborhood. If defines a distance between two data instances based on their similarity, and then either 1) use the distance of a data instance to its kth nearest neighbor, or 2) compute the relative density of each data instance to compute an anomaly score.
Nearest neighbor is an unsupervised technique, which, if appropriate distance measure for the given data is defined, is pure data driven.
Clustering algorithm is another unsupervised technique, which group similar data instances into clusters. The instance that is outside of any clusters is an anomaly. Similar to nearest neighbor technique, Clustering also require computation of distance between instances. The main difference is that the purpose of clustering is to find the absolute position of the center of the clusters, while nearest neighbor uses the relative position of each data instance.
Clustering algorithm includes K-means, Self-organizing maps, or Expectation Maximization. Some argue that clustering algorithm looks for similarity to identify clusters, and anomaly detection is just a by-product from unoptimized techniques.
Statistical techniques use a statistical model built on historical data with normal behavior and then apply a statistical inference test to determine if an testing instance belongs to this model or not. If it does not belong, it’s anomaly.
Statistical techniques can be parametric, such as Gaussian or Regression model, or it can be non-parametric, such as Histogram based. The key to statistical techniques is the assumption that data is generated from a particular distribution. If it’s true, it’s a statistically justifiable solution. Unfortunate, it is not always the case, especially for high dimensional real data sets.
If anomalies in data induce irregularities in the information content of the data set, we can use information theoretic techniques. Information theoretic techniques can be described as: given a data set D, let C(D) denote the complexity, find the minimal subset of instances, I, such that C(D)−C(D−I) is maximum. All instances in the subset thus obtained, are deemed as anomalous. Common algorithms inlude Kolomogorov complexity, entropy and relative entropy.
Spectral techniques can be used if data can be embedded into a lower dimensional subspace in which normal instances and anomalies appear significantly different. Such technique will find the subspace and identify the anomaly. A common method is Principal Component Analysis (PCA), which projects data into a lower dimensional space, and an instance of the data that deviates from the correlation structure is an anomaly.
Spectral techniques usually reduces the dimensions of data and have high computational complexity.
All the above techniques are dealing with point anomaly. For contextual and collective anomaly, it is a common practice to transform the sequences to a finite feature space and then use a point anomaly detection technique in the new space to detect anomalies.
All the anomaly we have discussed so far focus on the machine. Another approach focus on the users of the system, and it tracks, collects and assessing data regarding user activities. Analytical methods that focus on user behaviors are called User Behavior Analytics.
User Behavior Analytics (UBA)
User Behavior Analytics analyzes events collected and performs behavior modeling, peer group analytics, graph mining, and other techniques to find hidden threats by identifying anomalies and stitching them together to form actionable threat patters, for example:
- Privileged account abuse
- Suspicious login
- Data exfiltration
- Virtual machine/container breach
- Unusual SaaS and remote user behavior
- Rogue mobile device transmitting malware
- Data theft from privileged app infiltration
- Malware command and control (CnC)
- Cloud compromise
- System malware infection.
UBA utilizes the same machine anomaly detection algorithm. Security tools equipped with machine learning is moving from providing insights to security operators to taking defensive actions to threads, slowly but surely. Will machine learning replace cyber security experts one day? Probably not, because it is inevitable that human and machines will always be allies on both sides of the cyber war.