Machine Learning Tools

This is an incomplete list of all machine learning tools currently available as of July 2016. I categorized them into Open Source tools and commercial tools, however, the open source tools usually have a commercialized version with support, and the commercial tools tend to include a free version so you can download and try them out. Click the product links to learn more.

Open Source


Spark MLlib

  • MLlib is Apache Spark’s scalable machine learning library.
    • Initial contribution from AMPLab, UC Berkeley
    • Shipped with Spark since version 0.8
    • Over 30 contributors
    • Includes any common machine learning and statistical algorithms
    • Supports Scala, Java and Python programming languages
  • Pros
    • Powerful processing performance of Spark. (10x faster in memory and 100x faster in hard disk.)
    • Runs on Hadoop, Mesos or Stand online.
    • Easy to code. (with Scala)
  • Cons
    • Spark requires experienced engineers.
  • Online Resources http://spark.apache.org/mllib/
  • Algorithm
  • –Basic Statistics
    • Summary, Correlation, Sampling, Hypothesis testing, and random data generation.

    –Classification and regression

    • linear regression with L1, L2, and elastic-net regularization
    • logistic regression and linear support vector machine (SVM)
    • Decision tree, naive Bayers, random forest and gradient-boosted trees
    • isotonic regression

    –Collaborative filtering/recommendation

    • alternating least squares (ALS)


    • k-means, bisecting k-means, Gaussian mixtures (GMM),
    • power iteration clustering, and latent Dirichlet allocation (LDA)

    –Dimensionality reduction

    • singular value decomposition (SVD) and QR decomposition
    • principal component analysis (PCA)

    –Frequent pattern mining

    • FP-growth, association rules, and PrefixSpan

    –feature extraction and transformations


    • limited-memory BFGS (L-BFGS)



Scikit-learn is a Python module for machine learning

  • built on top of SciPy
  • Open source, commercially usable – BSD license
  • Started in 2007 as a Google Summer of Code.
  • Built on NumPy, SciPy, and matplotlib

Git: https://github.com/scikit-learn/scikit-learn.git

  • Algorithms
    • classification: SVM, nearest neighbors, random forest
    • regression: support vector regression (SVR), ridge regression, Lasso, logistic regression
    • clustering: k-means, spectral clustering, …
    • decomposition: PCA, non-negative matrix factorization (NMF), independent component analysis (ICA), …
    • model selection: grid search, cross validation, metrics
    • preprocessing: preprocessing, feature extraction




  • H2O is open-source software for big-data analysis.
  • Built by a Startup H2O.ai in 2011 in Sillicon Valley.
  • Users can throw models at data to find usable information, allowing H2O to discover patterns.
  • Provides data structures and methods suitable for big data.
  • Works with cloud, hadoop, and all operating systems.
  • Written and supported Java, Python and R.
  • Graphical interface works with all browsers.
  • Website: http://www.h2o.ai 


  • pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
  • Python is good for data munging and preparation. Panda helps with data analysis and modeling.
  • Works great when combined with iPython toolkit.
  • Good for linear and panel regression. Others can be found in scikit-learn.


Google TensorFlow

  • Open source machine learning library developed by Google, and used in a lot of Google products such as google translate, map and gmails.
  • Uses data flow graphs for numeric computation. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.
  • Extensive built-in support for deep learning
  • Just another library. Not the trained models or suggested algorithm for google products.
  • Cloud offering – Google Cloud ML

rstudio-ball R

  • R is a free software environment for statistical computing and graphics.
  • Pros
    • Open source and enterprise ready with Rstudio.
    • Huge ecosystem, lots of libraries and packages.
    • Runs on all operating systems, and files of all format.
  • Cons
    • Algorithm implementations varies and results are different.
    • Memory management not good. Performance worsen with more data.
  • Most used R ML Packages
    • e1071 Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier
    • rpart Recursive Partitioning and Regression Trees.
    • igraph A collection of network analysis tools.
    • nnet Feed-forward Neural Networks and Multinomial Log-Linear Models.
    • randomForest Breiman and Cutler’s random forests for classification and regression.
    • caret package (short for Classification And Regression Training)
    • glmnet Lasso and elastic-net regularized generalized linear models.
    • ROCR Visualizing the performance of scoring classifiers.
    • gbm Generalized Boosted Regression Models.
    • party A Laboratory for Recursive Partitioning.
    • arules Mining Association Rules and Frequent Itemsets.
    • tree Classification and regression trees.
    • klaR Classification and visualization.
    • RWeka R/Weka interface.
    • ipred Improved Predictors.
    • lars Least Angle Regression, Lasso and Forward Stagewise.
    • earth Multivariate Adaptive Regression Spline Models.
    • CORElearn Classification, regression, feature evaluation and ordinal evaluation.
    • mboost Model-Based Boosting.



  • Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:
    • tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.
    • transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
    • efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
    • speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
    • dynamic C code generation – Evaluate expressions faster.
    • extensive unit-testing and self-verification – Detect and diagnose many types of errors.
  • Theano has been powering large-scale computationally intensive scientific investigations since 2007.


  • Waikato Environment for Knowledge Analysis (Weka) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand.
  • It is free software licensed under the GNU General Public License.
  • contains a collection of visualization tools and algorithms for data analysis and predictive modeling.
  • Weka’s main user interface is the Explorer.
  • impossible to train models from large datasets using the Weka Explorer graphical user interface.
  • Use command-line interface (CLI) or write Java/Groovy/Jython.
  • Supports some streaming.


ml AWS Machine Learning

  • Provides visualization tools and wizards to create machine learning models.
  • Easy to obtain predictions for the built model using simple APIs.
  • Used by internal data scientist community.
  • Highly scalable, supports real-time process and at high throughput.
  • Cloud based. Pay as you go.

9564126_orig  Azure Machine Learning

  • Provides visualization tools and wizards to create machine learning models.
  • Easy to obtain predictions for the built model using simple APIs.
  • Used by internal data scientist community.
  • Highly scalable, supports real-time process and at high throughput.
  • Cloud based. Pay as you go.

IBM Watson Analytics

  • IBM data analysis solution in the cloud.
  • Automated visualization.
  • Professional version: 10m rows, 500 columns, 100GB storage.
  • Connects to social media data.
  • Supports free form text questions about data (Google Search Box).
  • Supports easy and secure collaboration.

sasviyalogomidnight  SAS Viya

  • Cloud ready analytics and visualization architecture from the leading analytics software company.
  • Can be onsite as well.
  • Supports following SAS platforms
    • SAS Visual Analytics
    • SAS Visual Statistics
    • SAS Visual Investigators (Search)
    • SAS Data Mining and Machine Learning
  • Supports Python, Lua, Java and all REST APIs.
  • Available third quarter of 2016



Advanced Analytics 101

While presenting advanced analytics practice to business executives or sales partners, it is important to let them understand some of the basic concepts. We need to be able to explain in a few sentences what some buzzwords are about. That’s the purpose of this blog.

Machine Learning

While I disagree with the statement that big data is just about machine learning, it shows how important it is, and how widely it is currently used. Author Samuel has given a really good definition to machine learning in 1959 as a “field of study that gives computers the ability to learn without being explicitly programmed.” For years, human beings have been doing a better and better job programming computers to do things for us, we are getting ready to let the machines build algorithms, study processes, and make decisions on their own. It’s made possible because of the maturity of technology to handle large volume of information in a timely fashion. That’s how machine learning becomes the jewel of the big data crown.

Machine learning can be overlapped with statistics. It uses mathematical optimization to build models, analyze data and deliver predictions. Machine learning can be categorized into supervised learning and unsupervised learning. In supervised learning, examples of input and output are presented by human being, the supervisor, and through calculation, the computer will learn the rules that map the inputs to the outputs. In unsupervised learning, no goal is provided and methods are designed for the computer to find out the structure of the data or a means to the ends.

Common problems that can be solved by machine learning are grouped into classification, clustering, regression, density estimation, and dimensionality reduction.

Deep Learning

Deep learning can be easily confused with machine learning. It is actually a branch of machine leanring that learns to represent data in an abstract way. It gets the name by using multi-layer non-linear processing units. The units can be supervised or unsupervised, and each layer uses the output of previous layer as input. The number of layers in deep learning is closely tied to the level of abstraction of the data, since it assumes the observed data are generated by the interactions of factors organized in layers.

Deep learning is actually a rebranding of the old neroscience because it is similar to the way information is communicated and processed in a nervous system, which defines a relationship between various stimuli and associated response in the brain.

A most successful deep learning algorithm is ANN – artificial neuro networks. It has addressed many problems such as image classification, language translation and spam identification.

Artificial Intelligence

The term artificial intelligence (AI) has most history and has a broader meaning. It mimics human minds and builds cognitive functions to learn and to solve problems. AI uses machine learning and deep learning algorithms. We could say the ultimate goal of AI is to build a machine that can think, talk and behave just as human, (such machines have been depicted vividly in countless books and movies,) but today, we have successfully build robots who can chat with us, machines able to beat the best human Chess or Go players, and cars that drive by themselves.

Pattern Recognition

Pattern recognition is a machine learning method to assign labels to input values, therefore, recognize the regularities, or patterns in the data. Pattern recognition aims to give a reasonable explanation of all possible training data, therefore, pattern matching can be applied to find a pattern for all new incoming data. We could also identify anomalies that do not match the recognized patterns.

Feature Engineering

Feature engineering is also a machine learning method to find the characteristics of the data. We can define a lot of attributes to the data, and the ones that can be used for prediction of any sorts are features. Feature engineering is an important part of predictive modeling, and the definition of the features will heavily impact the results of prediction. Feature engineering process involves brainstorming, buiding, repetitive validation, improving of the features and usually involves both data analytics and business users.
If you have read to this point, you are really interested in advanced analytics. Please stay tuned as I will explain machine learning tools including Spark MLlib in detail in my future blogs.

Introduction to ElasticSearch in Scala

I haven’t had time recently to write blogs of my own, but I saw this blog and I think it’s really helpful for those who are interested in doing ElasticSearch, so I’m going to share it on my own blog. This is the first time I’m doing a reblog.

Knoldus Blogs

Elasticsearch is a real-time distributed search and analytics engine built on top of Apache Lucene. It is used for full-text search, structured search and analytics.

Lucene is just a library and to leverage its power you need to use Java. Integrating Lucene directly with your application is a very complex task.

Elasticsearch uses the indexing and searching capabilities of Lucene but hides the complexities behind a simple RESTful API.

In this post we will learn to perform basic CRUD operations using Elasticsearch transport client in Scala with sbt as our build-tool.

Let us start by downloading Elasticsearch from here and unzipping it.

Execute the following command to run Elasticsearch in foreground:

Test it out by opening another terminal window and running the following:

To start with the coding part, create a new sbt project and add the following dependency in the build.sbt file.

Next, we need to create a…

View original post 279 more words

Install Hadoop and Spark on a Mac

Hadoop best performs on a cluster of multiple nodes/servers, however, it can run perfectly on a single machine, even a Mac, so we can use it for development. Also, Spark is a popular tool to process data in Hadoop. The purpose of this blog is to show you the steps to install Hadoop and Spark on a Mac.

Operating System: Mac OSX Yosemite 10.11.3
Hadoop Version 2.7.2
Spark 1.6.1


1. Install Java

Open a terminal window to check what Java version is installed.
$ java -version

If Java is not installed, go to https://java.com/en/download/ to download and install latest JDK. If Java is installed, use following command in a terminal window to find the java home path
$ /usr/libexec/java_home

Next we need to set JAVA_HOME environment on mac
$ echo export “JAVA_HOME=$(/usr/libexec/java_home)” >> ~/.bash_profile
$ source ~/.bash_profile

2. Enable SSH as Hadoop requires it.

Go to System Preferences -> Sharing -> and check “Remote Login”.

Generate SSH Keys
$ ssh-keygen -t rsa -P “”
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Open a terminal window, and make sure we can do this.
$>ssh localhost

Download Hadoop Distribution

Download the latest hadoop distribution (2.7.2 at the time of writing)

Create Hadoop Folder

Open a new terminal window, and go to the download folder, (let’s use “~/Downloads”), and find hadoop-2.7.2.tar

$ cd ~/Downloads
$ tar xzvf hadoop-2.7.2.tar
$ mv hadoop-2.7.2 /usr/local/hadoop

Hadoop Configuration Files

Go to the directory where your hadoop distribution is installed.
$ cd /usr/local/hadoop

Then change the following files

$ vi etc/hadoop/hdfs-site.xml


$ vi etc/hadoop/core-site.xml


$ vi etc/hadoop/yarn-site.xml


$ vi etc/hadoop/mapred-site.xml


Start Hadoop Services

Format HDFS
$ cd /usr/local/hadoop
$ bin/hdfs namenode -format

Start HDFS
$ sbin/start-dfs.sh

Start YARN
$ sbin/start-yarn.sh


Check HDFS file Directory
$ bin/hdfs dfs -ls /

If you don’t like to include the bin/ every time you run a hadoop command, you can do the following

$ vi ~/.bash_profile
append this line to the end of the file “export PATH=$PATH:/usr/local/hadoop/bin”
$ source ~/.bash_profile

Now try to add the following two folders in HDFS that is needed for MapReduce job, but this time, don’t include the bin/.

$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/{your username}

You can also open a browser and access Hadoop by using the following URL

Next: Spark

Installing Spark is a little easier. You can download the latest Spark here:

It’s a little tricky on choosing which package type. We want to choose “pre-build with user provided Hadoop [can use with most Hadoop distributions]” type, and the downloaded file name is spark-1.6.1-bin-without-hadoop.tgz

After spark is downloaded, we need to untar it. Open a terminal window and do the following:

$ cd ~/Downloads
$ tar xzvf spark-1.6.1-bin-without-hadoop.tgz
$ mv spark-1.6.1-bin-without-hadoop /usr/local/spark

Add spark bin folder to PATH

$ vi ~/.bash_profile
append this line to the end of the file “export PATH=$PATH:/usr/local/spark/bin”
$ source ~/.bash_profile

What about Scala?

Spark is written in Scala, so even though we can use Java to write Spark code, we want to install Scala as well.

Download Scala from here: http://www.scala-lang.org/download/
Choose the first one to download Scala in binary, and the downloaded file is scala-2.11.8.tar

Untar Scala and move it to a dedicated folder

$ cd ~/Downloads
$ tar xzvf scala-2.11.8.tar
$ mv scala-2.11.8 /usr/local/scala

Add Scala bin folder to PATH

$ vi ~/.bash_profile
append this line to the end of the file “export PATH=$PATH:/usr/local/scala/bin”
$ source ~/.bash_profile

Now you should be able to do the following to access Spark shell for Scala

$ spark-shell

That’s it! Happy coding!

Popular File Formats for Hadoop/HDFS

Hadoop is an ecosystem including many tools to store and process big data. HDFS is one that is used for storage, and it’s a special file system different than the one used on our desktop machines. We are not going to explain why it’s special, instead, we will introduce several special file formats supported by HDFS.

Text/CSV Files

CSV file is the most commonly used data file format. It’s the most readable and also ubiquitously easy to parse. It’s the choice of format to use when export data from an RDBMS table. However, human readable does not mean it’s machine readable. It has three major drawbacks when used for HDFS. First of all, all lines in a CSV file is a record, therefore, we should not include any headers or footers. In other word, CSV file cannot be stored in HDFS with any meta data. Second of all, CSV file has very limited support for schema evolution. Because the fields for each record are ordered, we are not able to change the orders. We can only append new fields to the end of each line. Last, CSV file does not support block compression which many other file formats support. The whole file has to be compressed and decompressed for reading, adding a significant read performance cost to the files.

JSON Files

JSON is in text format that stores meta data with the data, so it fully supports schema evolution. You can easily add or remove attributes for each datum. However, because it’s text file, it doesn’t support block compression.

Avro Files

Avro File is serialized data in binary format. It uses JSON to define data types, therefore it is row based. It is the most popular storage format for Hadoop. Avro stores meta data with the data, and it also allows specification of independent schema used for reading the files. Therefore, you can easily add, delete, update data fields by just creating a new independent schema. Also, Avro files are splittable, support block compression and enjoys a wide arrange of tool support within Hadoop ecosystem.

Sequence Files

Sequence files are binary files with a CSV-like structure. It does not store meta data, nor does it support schema evolution, but it does support block compression. Due to its unreadability, they are mostly used for intermediate data storage within a sequence of MapReduce jobs.

ORC Files

RC files or Record Columnar files are columnar file format. It’s great for compression and best for query performance, with the sacrifice of cost of more memory and poor write performance. ORC are optimized RC files that works better with Hive. It compresses better, but still does not support schema evolution. It is worthwhile to note that OCR is a format primarily backed by Hortonworks, and it’s not supported by Cloudera Impala.

Parquet Files

Paquet file format is also a columnar format. Just like ORC file, it’s great for compression with great query performance. It’s especially efficient when querying data from specific columns. Parquet format is computationally intensive on the write side, but it reduces a lot of I/O cost to make great read performance. It enjoys more freedom than ORC file in schema evolution, that it can add new columns to the end of the structure. It is also backed by Cloudera and optimized with Impala.

Since Avro and Parquet have so much in common, let’s review a little bit more of both. When choosing a file format to use with HDFS, we need to consider read performance and write performance. Because the nature of HDFS is to store data that is write once, read multiple times, we want to emphasize on the read performance. The fundamental difference in terms of how to use either format is this: Avro is a Row based format. If you want to retrieve the data as a whole, you can use Avro. Parquet is a Column based format. If your data consists of lot of columns but you are interested in a subset of columns, you can use Parquet.

Statistics vs. Machine Learning

I was being asked the question “What’s the difference between statistics and machine learning?” quite a lot lately, almost as often as this one “What’s the difference between a data analyst and a data scientist”, (which I might write about in another blog.) People wondering about the differences between these subjects probably see a lot of similarity between the two: They are both means to learn about the data, and they share many of the same methods.

The fundamental difference about the two is: statistics is focused on inference and conclusions while machine learning emphasizes on predictions and decisions.

Statisticians care deeply about the data collection process, methodology and statistical properties of the estimator. They are interested in learning something about the data. Statistics may support or reject hypothesis based on the noise of the data, validate models, or make forecasts, but overall the goal is to arrive at a new scientific insight based on the data. It other word, it wants to draw a valid and precise conclusion on problems proposed.

Machine Learning is about making a prediction, and algorithm is just a means to the end. The goal is to solve complex computational task by feeding data to a machine so it will tell us what the outcome will be. Instead of figuring out the cause and effect, we will collect a large amount of examples of what the mechanism should be, and then run an algorithm which is able to perform the task by learning from the examples. it builds model to predict a result, and use data to improve its prediction.

You may have realized that quite a few algorithms used in machine learning are statistical in nature, but as long as the prediction works well, any kind of statistical insight into the data is not necessary.

A paper Statistical Modeling: The Two Cultures published by Leo Breiman in the year 2001 explains the differences between statistics and machine learning very well. I’m going to post the abstract here:


There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.”

In this paper, two cultures are introduced and we can treat Data Modeling Culture as Statistics and Algorithmic Modeling Culture as Machine Learning. (The term Machine Learning still resides mostly in science fictions in 2001.) The following two pictures show clearly the difference between the two cultures.

Data modeling culture assumes a data model and estimates the values from the parameters using the data model.

Screen Shot 2016-04-19 at 11.55.11 PM

Algorithmic modeling treats the true algorithm inside the box complex and unknown. It creates another algorithm that operates on x to predict y.

Screen Shot 2016-04-19 at 11.55.35 PM

Now I wonder if there will be anyone who can’t wait for my blog about the other frequently asked question, and pop this one: “What about Data Science?”

Data Science employs all the techniques and theories drawn from many fields including mathematics, statistics, information science, computer science, which also includes machine learning, data mining, predictive analytics, etc. to extract knowledge or insights from data. Data scientist is not a new fancy title on name cards; he is a true master of data.

Big Data Everywhere

If your work is related to Big Data, but you have not heard of Big Data Everywhere Conference, don’t panic, chances are that you are just not using MapR Hadoop. This is an event sponsored by MapR and its many partners. However, the topics cover all area of Big Data, and you won’t feel discriminated if you have only been using Cloudera or Hortonworks.

The conference is held in many cities many times of a year and the one I attended is in San Diego on April 12, 2016. Traffic was really bad on Interstate 5 from Orange County to San Diego that morning, and I was 2 hours on the road and 45 minutes late. The breakfast provided was really good, so I decided to spend the 15 minutes eating instead of socializing with a full room of talented data professionals.

A full agenda is shown in the following picture, and I will summarize all the talks in this event based on my own written notes since the organizer still has not sent out the official presentation decks.

[Update 4/14/2016] Presentation full deck for the talks is available now.

FullSizeRender 5

First speaker is Jim Scott, director of enterprise strategy and architecture from MapR. His topic is Streaming in the Extreme. First he explained what is the enterprise architecture with a circular diagram he drew himself covering all area of company data strategy, with an emphasis on the fact that solution architecture is not equal to enterprise architecture. Later he introduced a streaming process he implemented using MapR streaming, which, according the statistics provided, beats Apache Kafka. When being asked if he considers MapR streamong is the best among all similar technology, including Flink, Spark, Apex, Storm, etc., Jim gave the opinion that MapR streaming is definitely the best when used with MapR Hadoop.


Next on the stage is Alex Garbarini, information technology engineer from Cisco, and his topic is Build and Operationalize Enterprise Data Lake in Big Enterprise. Being a technology company, Cisco was able to implement a data lake themselves using Hadoop that handles 2 billion records on a daily basis. The data lake is now a hub for multiple business usage including the analysis of Webex user activities.

Right after a talk about data lake, is a topic titled Going Beyond Data Lake. Vik Kapoor, director of analytics technology architecture and platforms from Pfizer talked about how they leverage the entire analytics ecosystem. They formed their practices following 4 steps: find, explore, understand, and share, which going through the data load, data wrangling, data discovery and evaluation processes and builds data products as a result. He also introduced the tools they are using for each step.

Coming up next is a panel discussion. Scott Saufferer and Robert Warner from ID Analytics answered interview questions from a host. The director of data operations and director of engineers took turns to tell the audience how they introduced Hadoop into their company, and how both teams collaborate to make the best of it.

Next on the stage is Alex Bates, a soft spoken CTO from Mtell, talking about hardware – IoT. Mtell manufactures smart machines with sensors built inside to transmit data about the status of the machines. Data is collected and processed by apps written in Spark. With the help of machine learning, they learned a lot about the machines, and created different agents to monitor anomalies and prevent failures. Also, RESTful API are created for different clients to integrate this with their monitoring tools.

When data architects and data scientists are fighting for the driver’s seats of big data groups within any organizations, it’s only fair to invite speakers from both sides in any big data conferences. Allen Day, chief scientist from MapR, contributor of many open source projects and machine learning algorithm implementation, did an awesome job explaining how to build a Genome Analysis Pipeline in simple words and diagrams that people with little knowledge of data science can understand. For those who wanted to dig deeper into this, he also provided the git link to the source code: https://github.com/allenday/spark-genome-alignment-demo.

Last but not least, Energetic Stefan Groschupf, CEO of Datameer jumped on the stage and gave a speech about how to jumpstart a big data project for any organizations. As a seasoned entrepreneur, he has a lot of experience in running an organization and his advice is simple and straightforward. Instead of spending a whole lot of money on latest technology, he suggested that a small team within the company to be formed. The members should be cross functional with different types of employees including the visionary, the reality check, the challengers and the worker bees. The team will focus on problems within the company before they bring up discussion of innovative idea or other people’s use cases. As big data project goes, the team will find a pain point, identify a few possible solutions, start from one small angle to approach it, try out different tools to tackle it as a proof of concept. A process with a successful result will then be scaled up into a full solution that could bring even more value to the company. And the core team members will also become the implementers and managers of the new process.

Believe it or not, this is exactly the kind of idea I approach my clients with as a big data consultant, and I’ve seen them become more and more confident and successful with what they are doing within a couple of years. “Great minds think alike!” That is a great feeling to go home with after a long half-day event.