Advanced Analytics

Cyber Security and Machine Learning

Talking about the relationship between cyber security and machine learning, we need to first identify a concept change. In the past, cyber security focuses on blocking the intruders from outside of our network, but today, we have to believe that intruders are among us. They have invaded our systems and they are doing or going to do damages to us. Whatever the compromised device or machine is doing, it’s acting abnormally. So, cyber security means anomaly detection. Learning about what the machines are normally behaving, we can identify the unusual behaviors, thus find the intruders and terminate them.

First, let’s take a look at the different cyber attacks. The major types of cyber attack that could use some help from machine learning includes:

Malware – they are software installed from attachment in phishing emails, or from web sites with malicious links. Natural language processing can definitely help analyzing the content of text being distributed within the network, block the content and alert the users. Also, malware usually use resources intensively, so they could be pinpoint down by CPU usage monitoring. Installing Anti-Malware software which maintains a library of common files or malicious IP address will help as well.

Zero day attack – Hackers attack computers using vulnerability of software that is unknown to public. To reduce the impact, Patches need to be installed as soon as they are available, and unpatched or newly patched machines must be scanned more frequently. Understanding the nature of the vulnerability, and using that in the feature engineering process makes machine learning more efficient.

APT – Advanced Persistent Thread is the worst of all cyber attacks. Intruders do not make any immediate damages after they compromised a machine, instead, they hide in the network and slowly steal data, affect more machines, and wait for a perfect time to launch attacks. Without analytics, detecting APT is almost impossible.

Academically, using machine learning and deep learning algorithms for anomaly detection has started in the late 1990’s, when, at the time, we don’t even have the term deep learning. It’s neural networks, and we will use the two terms interchangeably in this blog.

So how do we detect anomaly? Before answering this question, we have to define normality. Every machine or device has a regular behavior, which can be analyzed and described using logs or events collected from all the machines. Any activities or sequence of activities that is different from the normal behavior may be an anomaly. We can define the following three kinds of anomalies:

  • Point Anomaly – in terms of machine activity, it could mean access to restricted systems, or any summation of behaviors statistically reaching predefined thresholds.
  • Contextual Anomaly – Unlike the point anomaly, contextual anomaly may look normal by itself. It’s only by comparing parameters within a timeframe can we find the irregularity.
  • Collective Anomaly – For this one, we need to look at a longer timespan and find out a collection of behaviors that doesn’t look normal.

We mentioned feature engineering earlier. Feature engineering means using the domain knowledge of the data to create features that will be used in machine learning. In the domain of cyber security, common features used are:

  • CPU usage
  • Login time
  • All Systems accessed
  • File directory
  • Amount of data transferred in and out
  • Application logs
  • Sys logs
  • Database logs

Last but not least, let’s review machine learning algorithms for detecting anomalies. According to Chandola, Banerjee, and Kumar in their 2009 Anomaly Detection: A Survey, there are following 6 anomaly detection techniques.

Classification based 
classification technique creates classifiers(models) through the training of labeled data, and then classify the test instances through the learnt models. It could be a single-class classification, where the whole training set has only one normal class. It can also be a multi-class classification, where the training set has more than one normal class.

Common algorithms includes Rule based algorithm, Naive bayesian, Support Vector Machines and Neural Networks. Application of classification based techniques on test instances can be fast and accurate, but it relies heavily on availability of accurate labels for the classes.

Nearest Neighbor based 
Nearest Neighbor based anomaly detection techniques assumes normal data occurs in dense neighborhood. If defines a distance between two data instances based on their similarity, and then either 1) use the distance of a data instance to its kth nearest neighbor, or 2) compute the relative density of each data instance to compute an anomaly score.

Nearest neighbor is an unsupervised technique, which, if appropriate distance measure for the given data is defined, is pure data driven.

Clustering based
Clustering algorithm is another unsupervised technique, which group similar data instances into clusters. The instance that is outside of any clusters is an anomaly. Similar to nearest neighbor technique, Clustering also require computation of distance between instances. The main difference is that the purpose of clustering is to find the absolute position of the center of the clusters, while nearest neighbor uses the relative position of each data instance.

Clustering algorithm includes K-means, Self-organizing maps, or Expectation Maximization. Some argue that clustering algorithm looks for similarity to identify clusters, and anomaly detection is just a by-product from unoptimized techniques.

Statistical
Statistical techniques use a statistical model built on historical data with normal behavior and then apply a statistical inference test to determine if an testing instance belongs to this model or not. If it does not belong, it’s anomaly.

Statistical techniques can be parametric, such as Gaussian or Regression model, or it can be non-parametric, such as Histogram based. The key to statistical techniques is the assumption that data is generated from a particular distribution. If it’s true, it’s a statistically justifiable solution. Unfortunate, it is not always the case, especially for high dimensional real data sets.

Information Theoretic
If anomalies in data induce irregularities in the information content of the data set, we can use information theoretic techniques. Information theoretic techniques can be described as: given a data set D, let C(D) denote the complexity, find the minimal subset of instances, I, such that C(D)−C(D−I) is maximum. All instances in the subset thus obtained, are deemed as anomalous. Common algorithms inlude Kolomogorov complexity, entropy and relative entropy.

Spectral
Spectral techniques can be used if data can be embedded into a lower dimensional subspace in which normal instances and anomalies appear significantly different. Such technique will find the subspace and identify the anomaly. A common method is Principal Component Analysis (PCA), which projects data into a lower dimensional space, and an instance of the data that deviates from the correlation structure is an anomaly.

Spectral techniques usually reduces the dimensions of data and have high computational complexity.

All the above techniques are dealing with point anomaly. For contextual and collective anomaly, it is a common practice to transform the sequences to a finite feature space and then use a point anomaly detection technique in the new space to detect anomalies.

All the anomaly we have discussed so far focus on the machine. Another approach focus on the users of the system, and it tracks, collects and assessing data regarding user activities. Analytical methods that focus on user behaviors are called User Behavior Analytics.


User Behavior Analytics (UBA)

User Behavior Analytics analyzes events collected and performs behavior modeling, peer group analytics, graph mining, and other techniques to find hidden threats by identifying anomalies and stitching them together to form actionable threat patters, for example:

  • Privileged account abuse
  • Suspicious login
  • Data exfiltration
  • Virtual machine/container breach
  • Unusual SaaS and remote user behavior
  • Rogue mobile device transmitting malware
  • Data theft from privileged app infiltration
  • Malware command and control (CnC)
  • Cloud compromise
  • System malware infection.

UBA utilizes the same machine anomaly detection algorithm. Security tools equipped with machine learning is moving from providing insights to security operators to taking defensive actions to threads, slowly but surely. Will machine learning replace cyber security experts one day? Probably not, because it is inevitable that human and machines will always be allies on both sides of the cyber war.

Advertisements

Advanced Analytics Reference Architecture

 

Building data platforms and deliverying advanced analytical services in the new age of data intelligence can be a daunting task. It’s not really helping with all the tools and methodologies that we know we can use. Therefore, a reference architecture is needed to provide guidelines for the process design and best practices for advanced analytics, so we can not only meet the business requirement, but also bring more value to the business.

1. Architectural Guidance

  • The architecture should cover all building blocks including the following: Data Infrastructure, Data Engineering, Traditional Business Intelligence, and Advanced Analytics. Within Advanced Analytics, we should include machine learning, deep learning, data science, predictive analytics, and the operationalization of models.
  • One of the first steps should be finding the gaps between current infrastructure, tools, technologies and the end state environment.
  • We need to create a unified approach to both structured and unstructured data. It’s perfectly fine to maintain two different environments for structured and unstructured data, although both systems will look more and more close to each other.
  • Rome is not built in one night. We need to first build a road map, with budget in mind, on how the organization can get to the end state, adept and/or pivot whenever needed along the way.

2. Best Practices

  • There is never one best solution for all. A different scenario will have its very own best approach. However, we can create standard approaches for different categories. Creating best practices for different categories or industries and make them options, it is by itself a best practice.
  • Things we need to consider when suggesting a best practice includes company size, current infrastructure, skillsets of existing IT personnel.

3. Framework for Solutions.

  • A reference architecture for Advanced Analytics is depicted in the following diagram. On the bottom of the picture are the data sources, divided into structured and unstructured categories. Structured data are mostly operational data from existing ERP, CRM, Accounting, and any other systems that create the transactions for the business. They are handled by relational databases RDBMS such as Oracle, Teradata, and MS SqlServer. The RDBMS can be used as the backend for applications which produce these transactions, and they are called OLTP – online transactional processing system. Periodically the transactional data will be copied over to data stores for analytical and reporting purpose. These data stores are also built on RDBMS, and they are called OLAP – online analytcal processing sytem. On top of data warehouse is business intelligence and data visualization. We have quite a few powerful tools to support this capability.
  • On the right side, where the unstructured data is processed, that’s the big data world. Just as for structured data, there is a variety of tools that we can use for ETL (Extract-Transform-Load) of the data into selected data platforms, which include Hadoop, NoSQL, and all those cloud based storage systems. Data is ingested into these filesystem based data stores, and is then processed by multiple analytical tools. The analytical results are either fed to the data visualization tools, or operationalized by APIs created using all kinds of technology.
  • Demanding for streaming process is also growing tremendously, which requires real-time or near real-time analytics of vast amount of data to identify treads, find anomalies, and predict results. A few tools that can be used in this category is recommended.

Screen Shot 2016-08-23 at 4.31.22 PM.png

4. The tools I recommend for multiple data processing purposes are listed as follows. (So they can be search-engine-friendly, even though they are all listed in the picture above.)

  • Data Ingestion (Paxata, Pentaho, Talend, informatica…)
  • Data Storage (Cloudera, Hortonworks, MapR Hadoop, Cassandra, HBase, MongoDB, S3, Google CloudPlatform…)
  • Data Analytics (Python, R, H2O (ML and DL), TensorFlow (GPU optional), Databricks/Spark, Gaffe (GPU), Torch (GPU) for for deep learning of image and sound)
  • Data Visualization (Tableau, Qlikview, Cisco DV…)