Cyber Security and Machine Learning

Talking about the relationship between cyber security and machine learning, we need to first identify a concept change. In the past, cyber security focuses on blocking the intruders from outside of our network, but today, we have to believe that intruders are among us. They have invaded our systems and they are doing or going to do damages to us. Whatever the compromised device or machine is doing, it’s acting abnormally. So, cyber security means anomaly detection. Learning about what the machines are normally behaving, we can identify the unusual behaviors, thus find the intruders and terminate them.

First, let’s take a look at the different cyber attacks. The major types of cyber attack that could use some help from machine learning includes:

Malware – they are software installed from attachment in phishing emails, or from web sites with malicious links. Natural language processing can definitely help analyzing the content of text being distributed within the network, block the content and alert the users. Also, malware usually use resources intensively, so they could be pinpoint down by CPU usage monitoring. Installing Anti-Malware software which maintains a library of common files or malicious IP address will help as well.

Zero day attack – Hackers attack computers using vulnerability of software that is unknown to public. To reduce the impact, Patches need to be installed as soon as they are available, and unpatched or newly patched machines must be scanned more frequently. Understanding the nature of the vulnerability, and using that in the feature engineering process makes machine learning more efficient.

APT – Advanced Persistent Thread is the worst of all cyber attacks. Intruders do not make any immediate damages after they compromised a machine, instead, they hide in the network and slowly steal data, affect more machines, and wait for a perfect time to launch attacks. Without analytics, detecting APT is almost impossible.

Academically, using machine learning and deep learning algorithms for anomaly detection has started in the late 1990’s, when, at the time, we don’t even have the term deep learning. It’s neural networks, and we will use the two terms interchangeably in this blog.

So how do we detect anomaly? Before answering this question, we have to define normality. Every machine or device has a regular behavior, which can be analyzed and described using logs or events collected from all the machines. Any activities or sequence of activities that is different from the normal behavior may be an anomaly. We can define the following three kinds of anomalies:

  • Point Anomaly – in terms of machine activity, it could mean access to restricted systems, or any summation of behaviors statistically reaching predefined thresholds.
  • Contextual Anomaly – Unlike the point anomaly, contextual anomaly may look normal by itself. It’s only by comparing parameters within a timeframe can we find the irregularity.
  • Collective Anomaly – For this one, we need to look at a longer timespan and find out a collection of behaviors that doesn’t look normal.

We mentioned feature engineering earlier. Feature engineering means using the domain knowledge of the data to create features that will be used in machine learning. In the domain of cyber security, common features used are:

  • CPU usage
  • Login time
  • All Systems accessed
  • File directory
  • Amount of data transferred in and out
  • Application logs
  • Sys logs
  • Database logs

Last but not least, let’s review machine learning algorithms for detecting anomalies. According to Chandola, Banerjee, and Kumar in their 2009 Anomaly Detection: A Survey, there are following 6 anomaly detection techniques.

Classification based 
classification technique creates classifiers(models) through the training of labeled data, and then classify the test instances through the learnt models. It could be a single-class classification, where the whole training set has only one normal class. It can also be a multi-class classification, where the training set has more than one normal class.

Common algorithms includes Rule based algorithm, Naive bayesian, Support Vector Machines and Neural Networks. Application of classification based techniques on test instances can be fast and accurate, but it relies heavily on availability of accurate labels for the classes.

Nearest Neighbor based 
Nearest Neighbor based anomaly detection techniques assumes normal data occurs in dense neighborhood. If defines a distance between two data instances based on their similarity, and then either 1) use the distance of a data instance to its kth nearest neighbor, or 2) compute the relative density of each data instance to compute an anomaly score.

Nearest neighbor is an unsupervised technique, which, if appropriate distance measure for the given data is defined, is pure data driven.

Clustering based
Clustering algorithm is another unsupervised technique, which group similar data instances into clusters. The instance that is outside of any clusters is an anomaly. Similar to nearest neighbor technique, Clustering also require computation of distance between instances. The main difference is that the purpose of clustering is to find the absolute position of the center of the clusters, while nearest neighbor uses the relative position of each data instance.

Clustering algorithm includes K-means, Self-organizing maps, or Expectation Maximization. Some argue that clustering algorithm looks for similarity to identify clusters, and anomaly detection is just a by-product from unoptimized techniques.

Statistical techniques use a statistical model built on historical data with normal behavior and then apply a statistical inference test to determine if an testing instance belongs to this model or not. If it does not belong, it’s anomaly.

Statistical techniques can be parametric, such as Gaussian or Regression model, or it can be non-parametric, such as Histogram based. The key to statistical techniques is the assumption that data is generated from a particular distribution. If it’s true, it’s a statistically justifiable solution. Unfortunate, it is not always the case, especially for high dimensional real data sets.

Information Theoretic
If anomalies in data induce irregularities in the information content of the data set, we can use information theoretic techniques. Information theoretic techniques can be described as: given a data set D, let C(D) denote the complexity, find the minimal subset of instances, I, such that C(D)−C(D−I) is maximum. All instances in the subset thus obtained, are deemed as anomalous. Common algorithms inlude Kolomogorov complexity, entropy and relative entropy.

Spectral techniques can be used if data can be embedded into a lower dimensional subspace in which normal instances and anomalies appear significantly different. Such technique will find the subspace and identify the anomaly. A common method is Principal Component Analysis (PCA), which projects data into a lower dimensional space, and an instance of the data that deviates from the correlation structure is an anomaly.

Spectral techniques usually reduces the dimensions of data and have high computational complexity.

All the above techniques are dealing with point anomaly. For contextual and collective anomaly, it is a common practice to transform the sequences to a finite feature space and then use a point anomaly detection technique in the new space to detect anomalies.

All the anomaly we have discussed so far focus on the machine. Another approach focus on the users of the system, and it tracks, collects and assessing data regarding user activities. Analytical methods that focus on user behaviors are called User Behavior Analytics.

User Behavior Analytics (UBA)

User Behavior Analytics analyzes events collected and performs behavior modeling, peer group analytics, graph mining, and other techniques to find hidden threats by identifying anomalies and stitching them together to form actionable threat patters, for example:

  • Privileged account abuse
  • Suspicious login
  • Data exfiltration
  • Virtual machine/container breach
  • Unusual SaaS and remote user behavior
  • Rogue mobile device transmitting malware
  • Data theft from privileged app infiltration
  • Malware command and control (CnC)
  • Cloud compromise
  • System malware infection.

UBA utilizes the same machine anomaly detection algorithm. Security tools equipped with machine learning is moving from providing insights to security operators to taking defensive actions to threads, slowly but surely. Will machine learning replace cyber security experts one day? Probably not, because it is inevitable that human and machines will always be allies on both sides of the cyber war.


Big Data and DevOps

When a new IT buzzword is formed, we tend to analyze its relationship with other IT aspects. Today we are going to review the relationship of two IT buzzwords: Big Data and DevOps.

What is DevOps

Dev is from the word “development” and Ops is from the word “operations”, but as you can see in figure 1, there is a QA piece which has not made it into the new name. Here’s the definition of DevOps from Wiki: “It is a culture, movement or practice that emphasizes the collaboration and communication of both software developers and other information-technology (IT) professionals while automating the process of software delivery and infrastructure changes.” Communication and collaboration is the key for DevOps, and a good QA is the key for a good communication of the dev team and the ops team.

devops.pngfigure 1

To better understand what DevOps is, I recommend the following blog from Marc Hornbeek, Principal Practice Architect on DevOps for Trace3, 7 Pillars of DevOps: Essential Practice for Enterprise Success, which highlighted the 7 key aspects of DevOps and how it will lead to a successful enterprise DevOps practice.

  1. Collaborative Culture – Dev, Ops and QA teams need to align goals and create cooperative procedures.
  2. Designed for DevOps – The basic for DevOps design is modular and immutable architecture using micro services.
  3. Continuous Integration – Must have minimum impact to production.
  4. Continuous Testing – Cover all pipelines and avoid bottleneck
  5. Continuous Monitoring – Ensure full coverage of all pipelines to avoid bottlenecks.
  6. Continuous Delivery – It’s a non-stop practice to support ever-changing business.
  7. Elastic Infrastructure – Virtualized environment/Cloud

DevOps7Pillarsfigure 2

The multiple steps in a DevOps process are called pipelines. There are different variations for the name of the pipes, but they are all similar to the ones in figure 2. It covers the key steps of the software development life cycle.

Design – Create – Merge – Build – Bind – Deliver – Deploy

DevOps Trends

Before moving on the big data, I would like to point out a few trends in DevOps:

Agile Methodology – Rapid development of solutions. Agile development is closely related to DevOps. The collaboration and communication between the developers who build and test applications and the IT teams that are responsible for deploying and maintaining IT systems and operations makes it possible for iterations of quick development and deployment.

Virtualized Infrastructure – Providing scalability and elasticity, with shared infrastructure resources that scales up or down as required. Deploying solutions in the cloud is the right directions.

Continuous Deployment – Continuously test solutions and continuously improve. Let me emphasize this one more time. We need to continuously provide upgrades to support the ever-changing business.

Big Data and Consulting

Big Data Consultant should always coordinate with DevOps no matter if the client is building a big data team, establishing big data infrastructure, or working on any big data analytics/development project.

  • Building teams – Skill sets review, talent acquisition, and training.
  • Big data infrastrucutre – Set up infrastructure in cloud or on premise. Install Hadoop, NoSql database, and third party tools and platforms.
  • Big data project – As consultants, we should never keep the ownership of projects or codes. Need to be able to transfer code, knowledge and support to client.

Based on the function of the group being consulted, whether business or IT, the recommendations can be at a strategic level or at an operational level. Either way, the consultant should bridge the gaps between business and IT.

They should also bridge the gaps between IT teams who are gatekeepers of the data and the data scientist and data analyst who need the infrastructure to run analytics.

A key to success for Big Data DevOps is the operationalization of predictive models to achieve continuous Analytics.

DevOps for Big Data


Basically, DevOps for Big Data can be divided into three categories: data infrastructure, data engineering and data analytics.

DevOps for Data Infrastructure – Provisioning data notes, deploying clusters, installing tools and security policies.

Big data technology such as Hadoop and Spark are getting more mature and more popular. Maintaining a group of seasoned developers and architects who understand how the technology is implemented, and has worked with the open source version can help keeping an edge and guiding the team to the right direction.

With virtual infrastructure getting more and more popular, it’s almost a must have to be able to work on elastic cluster provisioning, monitoring and auto scaling in the cloud.

DevOps for Data Engineering – Defining data structure, ETL, creating APIs, and providing data-platform-as-a-service supports data scientist.

We need to consider the following, when planning DevOps for data engineering.

  1. As any other DevOps projects, signed off project plan and design document need to be obtained, so we can have a clear scope of the initiative and provide better estimation. (for big data projects, your client will think really big.)
  2. Is it a truly big data project? Do not achieve the goal of creating an RDBMS in Hadoop. (ROI is way too small.)
  3. It is a common practice to create data warehouse in Hadoop, the money saved in adding more Oracle and Teradata server can be used to set up Big Data infrastructure, but the ultimate goal of introducing a big data environment is to support Data Analytics.
  4. We can recommend products, or help build solutions, but in the end, the client needs to be able to achieve self services.

DevOps for Data Analytics – Building Models, Turning prototypes into operational solutions.

Data Science is also development. Data Scientist needs to write code and test result in order to find a solution, and that solution needs to be operationalized.


We have discussed the relationship of DevOps and Big Data in this blog,  but it is another interesting topic on how big data will change the landscape of DevOps, which we will talk about next time.



Advanced Analytics Reference Architecture


Building data platforms and deliverying advanced analytical services in the new age of data intelligence can be a daunting task. It’s not really helping with all the tools and methodologies that we know we can use. Therefore, a reference architecture is needed to provide guidelines for the process design and best practices for advanced analytics, so we can not only meet the business requirement, but also bring more value to the business.

1. Architectural Guidance

  • The architecture should cover all building blocks including the following: Data Infrastructure, Data Engineering, Traditional Business Intelligence, and Advanced Analytics. Within Advanced Analytics, we should include machine learning, deep learning, data science, predictive analytics, and the operationalization of models.
  • One of the first steps should be finding the gaps between current infrastructure, tools, technologies and the end state environment.
  • We need to create a unified approach to both structured and unstructured data. It’s perfectly fine to maintain two different environments for structured and unstructured data, although both systems will look more and more close to each other.
  • Rome is not built in one night. We need to first build a road map, with budget in mind, on how the organization can get to the end state, adept and/or pivot whenever needed along the way.

2. Best Practices

  • There is never one best solution for all. A different scenario will have its very own best approach. However, we can create standard approaches for different categories. Creating best practices for different categories or industries and make them options, it is by itself a best practice.
  • Things we need to consider when suggesting a best practice includes company size, current infrastructure, skillsets of existing IT personnel.

3. Framework for Solutions.

  • A reference architecture for Advanced Analytics is depicted in the following diagram. On the bottom of the picture are the data sources, divided into structured and unstructured categories. Structured data are mostly operational data from existing ERP, CRM, Accounting, and any other systems that create the transactions for the business. They are handled by relational databases RDBMS such as Oracle, Teradata, and MS SqlServer. The RDBMS can be used as the backend for applications which produce these transactions, and they are called OLTP – online transactional processing system. Periodically the transactional data will be copied over to data stores for analytical and reporting purpose. These data stores are also built on RDBMS, and they are called OLAP – online analytcal processing sytem. On top of data warehouse is business intelligence and data visualization. We have quite a few powerful tools to support this capability.
  • On the right side, where the unstructured data is processed, that’s the big data world. Just as for structured data, there is a variety of tools that we can use for ETL (Extract-Transform-Load) of the data into selected data platforms, which include Hadoop, NoSQL, and all those cloud based storage systems. Data is ingested into these filesystem based data stores, and is then processed by multiple analytical tools. The analytical results are either fed to the data visualization tools, or operationalized by APIs created using all kinds of technology.
  • Demanding for streaming process is also growing tremendously, which requires real-time or near real-time analytics of vast amount of data to identify treads, find anomalies, and predict results. A few tools that can be used in this category is recommended.

Screen Shot 2016-08-23 at 4.31.22 PM.png

4. The tools I recommend for multiple data processing purposes are listed as follows. (So they can be search-engine-friendly, even though they are all listed in the picture above.)

  • Data Ingestion (Paxata, Pentaho, Talend, informatica…)
  • Data Storage (Cloudera, Hortonworks, MapR Hadoop, Cassandra, HBase, MongoDB, S3, Google CloudPlatform…)
  • Data Analytics (Python, R, H2O (ML and DL), TensorFlow (GPU optional), Databricks/Spark, Gaffe (GPU), Torch (GPU) for for deep learning of image and sound)
  • Data Visualization (Tableau, Qlikview, Cisco DV…)

Machine Learning Tools

This is an incomplete list of all machine learning tools currently available as of July 2016. I categorized them into Open Source tools and commercial tools, however, the open source tools usually have a commercialized version with support, and the commercial tools tend to include a free version so you can download and try them out. Click the product links to learn more.

Open Source


Spark MLlib

  • MLlib is Apache Spark’s scalable machine learning library.
    • Initial contribution from AMPLab, UC Berkeley
    • Shipped with Spark since version 0.8
    • Over 30 contributors
    • Includes any common machine learning and statistical algorithms
    • Supports Scala, Java and Python programming languages
  • Pros
    • Powerful processing performance of Spark. (10x faster in memory and 100x faster in hard disk.)
    • Runs on Hadoop, Mesos or Stand online.
    • Easy to code. (with Scala)
  • Cons
    • Spark requires experienced engineers.
  • Online Resources
  • Algorithm
  • –Basic Statistics
    • Summary, Correlation, Sampling, Hypothesis testing, and random data generation.

    –Classification and regression

    • linear regression with L1, L2, and elastic-net regularization
    • logistic regression and linear support vector machine (SVM)
    • Decision tree, naive Bayers, random forest and gradient-boosted trees
    • isotonic regression

    –Collaborative filtering/recommendation

    • alternating least squares (ALS)


    • k-means, bisecting k-means, Gaussian mixtures (GMM),
    • power iteration clustering, and latent Dirichlet allocation (LDA)

    –Dimensionality reduction

    • singular value decomposition (SVD) and QR decomposition
    • principal component analysis (PCA)

    –Frequent pattern mining

    • FP-growth, association rules, and PrefixSpan

    –feature extraction and transformations


    • limited-memory BFGS (L-BFGS)



Scikit-learn is a Python module for machine learning

  • built on top of SciPy
  • Open source, commercially usable – BSD license
  • Started in 2007 as a Google Summer of Code.
  • Built on NumPy, SciPy, and matplotlib


  • Algorithms
    • classification: SVM, nearest neighbors, random forest
    • regression: support vector regression (SVR), ridge regression, Lasso, logistic regression
    • clustering: k-means, spectral clustering, …
    • decomposition: PCA, non-negative matrix factorization (NMF), independent component analysis (ICA), …
    • model selection: grid search, cross validation, metrics
    • preprocessing: preprocessing, feature extraction




  • H2O is open-source software for big-data analysis.
  • Built by a Startup in 2011 in Sillicon Valley.
  • Users can throw models at data to find usable information, allowing H2O to discover patterns.
  • Provides data structures and methods suitable for big data.
  • Works with cloud, hadoop, and all operating systems.
  • Written and supported Java, Python and R.
  • Graphical interface works with all browsers.
  • Website: 


  • pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
  • Python is good for data munging and preparation. Panda helps with data analysis and modeling.
  • Works great when combined with iPython toolkit.
  • Good for linear and panel regression. Others can be found in scikit-learn.


Google TensorFlow

  • Open source machine learning library developed by Google, and used in a lot of Google products such as google translate, map and gmails.
  • Uses data flow graphs for numeric computation. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.
  • Extensive built-in support for deep learning
  • Just another library. Not the trained models or suggested algorithm for google products.
  • Cloud offering – Google Cloud ML

rstudio-ball R

  • R is a free software environment for statistical computing and graphics.
  • Pros
    • Open source and enterprise ready with Rstudio.
    • Huge ecosystem, lots of libraries and packages.
    • Runs on all operating systems, and files of all format.
  • Cons
    • Algorithm implementations varies and results are different.
    • Memory management not good. Performance worsen with more data.
  • Most used R ML Packages
    • e1071 Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier
    • rpart Recursive Partitioning and Regression Trees.
    • igraph A collection of network analysis tools.
    • nnet Feed-forward Neural Networks and Multinomial Log-Linear Models.
    • randomForest Breiman and Cutler’s random forests for classification and regression.
    • caret package (short for Classification And Regression Training)
    • glmnet Lasso and elastic-net regularized generalized linear models.
    • ROCR Visualizing the performance of scoring classifiers.
    • gbm Generalized Boosted Regression Models.
    • party A Laboratory for Recursive Partitioning.
    • arules Mining Association Rules and Frequent Itemsets.
    • tree Classification and regression trees.
    • klaR Classification and visualization.
    • RWeka R/Weka interface.
    • ipred Improved Predictors.
    • lars Least Angle Regression, Lasso and Forward Stagewise.
    • earth Multivariate Adaptive Regression Spline Models.
    • CORElearn Classification, regression, feature evaluation and ordinal evaluation.
    • mboost Model-Based Boosting.



  • Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:
    • tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.
    • transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
    • efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
    • speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
    • dynamic C code generation – Evaluate expressions faster.
    • extensive unit-testing and self-verification – Detect and diagnose many types of errors.
  • Theano has been powering large-scale computationally intensive scientific investigations since 2007.


  • Waikato Environment for Knowledge Analysis (Weka) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand.
  • It is free software licensed under the GNU General Public License.
  • contains a collection of visualization tools and algorithms for data analysis and predictive modeling.
  • Weka’s main user interface is the Explorer.
  • impossible to train models from large datasets using the Weka Explorer graphical user interface.
  • Use command-line interface (CLI) or write Java/Groovy/Jython.
  • Supports some streaming.


ml AWS Machine Learning

  • Provides visualization tools and wizards to create machine learning models.
  • Easy to obtain predictions for the built model using simple APIs.
  • Used by internal data scientist community.
  • Highly scalable, supports real-time process and at high throughput.
  • Cloud based. Pay as you go.

9564126_orig  Azure Machine Learning

  • Provides visualization tools and wizards to create machine learning models.
  • Easy to obtain predictions for the built model using simple APIs.
  • Used by internal data scientist community.
  • Highly scalable, supports real-time process and at high throughput.
  • Cloud based. Pay as you go.

IBM Watson Analytics

  • IBM data analysis solution in the cloud.
  • Automated visualization.
  • Professional version: 10m rows, 500 columns, 100GB storage.
  • Connects to social media data.
  • Supports free form text questions about data (Google Search Box).
  • Supports easy and secure collaboration.

sasviyalogomidnight  SAS Viya

  • Cloud ready analytics and visualization architecture from the leading analytics software company.
  • Can be onsite as well.
  • Supports following SAS platforms
    • SAS Visual Analytics
    • SAS Visual Statistics
    • SAS Visual Investigators (Search)
    • SAS Data Mining and Machine Learning
  • Supports Python, Lua, Java and all REST APIs.
  • Available third quarter of 2016



Advanced Analytics 101

While presenting advanced analytics practice to business executives or sales partners, it is important to let them understand some of the basic concepts. We need to be able to explain in a few sentences what some buzzwords are about. That’s the purpose of this blog.

Machine Learning

While I disagree with the statement that big data is just about machine learning, it shows how important it is, and how widely it is currently used. Author Samuel has given a really good definition to machine learning in 1959 as a “field of study that gives computers the ability to learn without being explicitly programmed.” For years, human beings have been doing a better and better job programming computers to do things for us, we are getting ready to let the machines build algorithms, study processes, and make decisions on their own. It’s made possible because of the maturity of technology to handle large volume of information in a timely fashion. That’s how machine learning becomes the jewel of the big data crown.

Machine learning can be overlapped with statistics. It uses mathematical optimization to build models, analyze data and deliver predictions. Machine learning can be categorized into supervised learning and unsupervised learning. In supervised learning, examples of input and output are presented by human being, the supervisor, and through calculation, the computer will learn the rules that map the inputs to the outputs. In unsupervised learning, no goal is provided and methods are designed for the computer to find out the structure of the data or a means to the ends.

Common problems that can be solved by machine learning are grouped into classification, clustering, regression, density estimation, and dimensionality reduction.

Deep Learning

Deep learning can be easily confused with machine learning. It is actually a branch of machine leanring that learns to represent data in an abstract way. It gets the name by using multi-layer non-linear processing units. The units can be supervised or unsupervised, and each layer uses the output of previous layer as input. The number of layers in deep learning is closely tied to the level of abstraction of the data, since it assumes the observed data are generated by the interactions of factors organized in layers.

Deep learning is actually a rebranding of the old neroscience because it is similar to the way information is communicated and processed in a nervous system, which defines a relationship between various stimuli and associated response in the brain.

A most successful deep learning algorithm is ANN – artificial neuro networks. It has addressed many problems such as image classification, language translation and spam identification.

Artificial Intelligence

The term artificial intelligence (AI) has most history and has a broader meaning. It mimics human minds and builds cognitive functions to learn and to solve problems. AI uses machine learning and deep learning algorithms. We could say the ultimate goal of AI is to build a machine that can think, talk and behave just as human, (such machines have been depicted vividly in countless books and movies,) but today, we have successfully build robots who can chat with us, machines able to beat the best human Chess or Go players, and cars that drive by themselves.

Pattern Recognition

Pattern recognition is a machine learning method to assign labels to input values, therefore, recognize the regularities, or patterns in the data. Pattern recognition aims to give a reasonable explanation of all possible training data, therefore, pattern matching can be applied to find a pattern for all new incoming data. We could also identify anomalies that do not match the recognized patterns.

Feature Engineering

Feature engineering is also a machine learning method to find the characteristics of the data. We can define a lot of attributes to the data, and the ones that can be used for prediction of any sorts are features. Feature engineering is an important part of predictive modeling, and the definition of the features will heavily impact the results of prediction. Feature engineering process involves brainstorming, buiding, repetitive validation, improving of the features and usually involves both data analytics and business users.
If you have read to this point, you are really interested in advanced analytics. Please stay tuned as I will explain machine learning tools including Spark MLlib in detail in my future blogs.

Introduction to ElasticSearch in Scala

I haven’t had time recently to write blogs of my own, but I saw this blog and I think it’s really helpful for those who are interested in doing ElasticSearch, so I’m going to share it on my own blog. This is the first time I’m doing a reblog.


Elasticsearch is a real-time distributed search and analytics engine built on top of Apache Lucene. It is used for full-text search, structured search and analytics.

Lucene is just a library and to leverage its power you need to use Java. Integrating Lucene directly with your application is a very complex task.

Elasticsearch uses the indexing and searching capabilities of Lucene but hides the complexities behind a simple RESTful API.

In this post we will learn to perform basic CRUD operations using Elasticsearch transport client in Scala with sbt as our build-tool.

Let us start by downloading Elasticsearch from here and unzipping it.

Execute the following command to run Elasticsearch in foreground:

Test it out by opening another terminal window and running the following:

To start with the coding part, create a new sbt project and add the following dependency in the build.sbt file.

Next, we need to create a…

View original post 279 more words

Screen Shot 2016-04-27 at 2.13.14 PM

Install Hadoop and Spark on a Mac

Hadoop best performs on a cluster of multiple nodes/servers, however, it can run perfectly on a single machine, even a Mac, so we can use it for development. Also, Spark is a popular tool to process data in Hadoop. The purpose of this blog is to show you the steps to install Hadoop and Spark on a Mac.

Operating System: Mac OSX Yosemite 10.11.3
Hadoop Version 2.7.2
Spark 1.6.1


1. Install Java

Open a terminal window to check what Java version is installed.
$ java -version

If Java is not installed, go to to download and install latest JDK. If Java is installed, use following command in a terminal window to find the java home path
$ /usr/libexec/java_home

Next we need to set JAVA_HOME environment on mac
$ echo export “JAVA_HOME=$(/usr/libexec/java_home)” >> ~/.bash_profile
$ source ~/.bash_profile

2. Enable SSH as Hadoop requires it.

Go to System Preferences -> Sharing -> and check “Remote Login”.

Generate SSH Keys
$ ssh-keygen -t rsa -P “”
$ cat ~/.ssh/ >> ~/.ssh/authorized_keys

Open a terminal window, and make sure we can do this.
$>ssh localhost

Download Hadoop Distribution

Download the latest hadoop distribution (2.7.2 at the time of writing)

Create Hadoop Folder

Open a new terminal window, and go to the download folder, (let’s use “~/Downloads”), and find hadoop-2.7.2.tar

$ cd ~/Downloads
$ tar xzvf hadoop-2.7.2.tar
$ mv hadoop-2.7.2 /usr/local/hadoop

Hadoop Configuration Files

Go to the directory where your hadoop distribution is installed.
$ cd /usr/local/hadoop

Then change the following files

$ vi etc/hadoop/hdfs-site.xml


$ vi etc/hadoop/core-site.xml


$ vi etc/hadoop/yarn-site.xml


$ vi etc/hadoop/mapred-site.xml


Start Hadoop Services

Format HDFS
$ cd /usr/local/hadoop
$ bin/hdfs namenode -format

Start HDFS
$ sbin/

Start YARN
$ sbin/


Check HDFS file Directory
$ bin/hdfs dfs -ls /

If you don’t like to include the bin/ every time you run a hadoop command, you can do the following

$ vi ~/.bash_profile
append this line to the end of the file “export PATH=$PATH:/usr/local/hadoop/bin”
$ source ~/.bash_profile

Now try to add the following two folders in HDFS that is needed for MapReduce job, but this time, don’t include the bin/.

$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/{your username}

You can also open a browser and access Hadoop by using the following URL

Next: Spark

Installing Spark is a little easier. You can download the latest Spark here:

It’s a little tricky on choosing which package type. We want to choose “pre-build with user provided Hadoop [can use with most Hadoop distributions]” type, and the downloaded file name is spark-1.6.1-bin-without-hadoop.tgz

After spark is downloaded, we need to untar it. Open a terminal window and do the following:

$ cd ~/Downloads
$ tar xzvf spark-1.6.1-bin-without-hadoop.tgz
$ mv spark-1.6.1-bin-without-hadoop /usr/local/spark

Add spark bin folder to PATH

$ vi ~/.bash_profile
append this line to the end of the file “export PATH=$PATH:/usr/local/spark/bin”
$ source ~/.bash_profile

What about Scala?

Spark is written in Scala, so even though we can use Java to write Spark code, we want to install Scala as well.

Download Scala from here:
Choose the first one to download Scala in binary, and the downloaded file is scala-2.11.8.tar

Untar Scala and move it to a dedicated folder

$ cd ~/Downloads
$ tar xzvf scala-2.11.8.tar
$ mv scala-2.11.8 /usr/local/scala

Add Scala bin folder to PATH

$ vi ~/.bash_profile
append this line to the end of the file “export PATH=$PATH:/usr/local/scala/bin”
$ source ~/.bash_profile

Now you should be able to do the following to access Spark shell for Scala

$ spark-shell

That’s it! Happy coding!