Hadoop

Advanced Analytics Reference Architecture

 

Building data platforms and deliverying advanced analytical services in the new age of data intelligence can be a daunting task. It’s not really helping with all the tools and methodologies that we know we can use. Therefore, a reference architecture is needed to provide guidelines for the process design and best practices for advanced analytics, so we can not only meet the business requirement, but also bring more value to the business.

1. Architectural Guidance

  • The architecture should cover all building blocks including the following: Data Infrastructure, Data Engineering, Traditional Business Intelligence, and Advanced Analytics. Within Advanced Analytics, we should include machine learning, deep learning, data science, predictive analytics, and the operationalization of models.
  • One of the first steps should be finding the gaps between current infrastructure, tools, technologies and the end state environment.
  • We need to create a unified approach to both structured and unstructured data. It’s perfectly fine to maintain two different environments for structured and unstructured data, although both systems will look more and more close to each other.
  • Rome is not built in one night. We need to first build a road map, with budget in mind, on how the organization can get to the end state, adept and/or pivot whenever needed along the way.

2. Best Practices

  • There is never one best solution for all. A different scenario will have its very own best approach. However, we can create standard approaches for different categories. Creating best practices for different categories or industries and make them options, it is by itself a best practice.
  • Things we need to consider when suggesting a best practice includes company size, current infrastructure, skillsets of existing IT personnel.

3. Framework for Solutions.

  • A reference architecture for Advanced Analytics is depicted in the following diagram. On the bottom of the picture are the data sources, divided into structured and unstructured categories. Structured data are mostly operational data from existing ERP, CRM, Accounting, and any other systems that create the transactions for the business. They are handled by relational databases RDBMS such as Oracle, Teradata, and MS SqlServer. The RDBMS can be used as the backend for applications which produce these transactions, and they are called OLTP – online transactional processing system. Periodically the transactional data will be copied over to data stores for analytical and reporting purpose. These data stores are also built on RDBMS, and they are called OLAP – online analytcal processing sytem. On top of data warehouse is business intelligence and data visualization. We have quite a few powerful tools to support this capability.
  • On the right side, where the unstructured data is processed, that’s the big data world. Just as for structured data, there is a variety of tools that we can use for ETL (Extract-Transform-Load) of the data into selected data platforms, which include Hadoop, NoSQL, and all those cloud based storage systems. Data is ingested into these filesystem based data stores, and is then processed by multiple analytical tools. The analytical results are either fed to the data visualization tools, or operationalized by APIs created using all kinds of technology.
  • Demanding for streaming process is also growing tremendously, which requires real-time or near real-time analytics of vast amount of data to identify treads, find anomalies, and predict results. A few tools that can be used in this category is recommended.

Screen Shot 2016-08-23 at 4.31.22 PM.png

4. The tools I recommend for multiple data processing purposes are listed as follows. (So they can be search-engine-friendly, even though they are all listed in the picture above.)

  • Data Ingestion (Paxata, Pentaho, Talend, informatica…)
  • Data Storage (Cloudera, Hortonworks, MapR Hadoop, Cassandra, HBase, MongoDB, S3, Google CloudPlatform…)
  • Data Analytics (Python, R, H2O (ML and DL), TensorFlow (GPU optional), Databricks/Spark, Gaffe (GPU), Torch (GPU) for for deep learning of image and sound)
  • Data Visualization (Tableau, Qlikview, Cisco DV…)
Advertisements

Install Hadoop and Spark on a Mac

Hadoop best performs on a cluster of multiple nodes/servers, however, it can run perfectly on a single machine, even a Mac, so we can use it for development. Also, Spark is a popular tool to process data in Hadoop. The purpose of this blog is to show you the steps to install Hadoop and Spark on a Mac.

Operating System: Mac OSX Yosemite 10.11.3
Hadoop Version 2.7.2
Spark 1.6.1

Pre-requisites

1. Install Java

Open a terminal window to check what Java version is installed.
$ java -version

If Java is not installed, go to https://java.com/en/download/ to download and install latest JDK. If Java is installed, use following command in a terminal window to find the java home path
$ /usr/libexec/java_home

Next we need to set JAVA_HOME environment on mac
$ echo export “JAVA_HOME=$(/usr/libexec/java_home)” >> ~/.bash_profile
$ source ~/.bash_profile

2. Enable SSH as Hadoop requires it.

Go to System Preferences -> Sharing -> and check “Remote Login”.

Generate SSH Keys
$ ssh-keygen -t rsa -P “”
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Open a terminal window, and make sure we can do this.
$>ssh localhost

Download Hadoop Distribution

Download the latest hadoop distribution (2.7.2 at the time of writing)
http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz

Create Hadoop Folder

Open a new terminal window, and go to the download folder, (let’s use “~/Downloads”), and find hadoop-2.7.2.tar

$ cd ~/Downloads
$ tar xzvf hadoop-2.7.2.tar
$ mv hadoop-2.7.2 /usr/local/hadoop

Hadoop Configuration Files

Go to the directory where your hadoop distribution is installed.
$ cd /usr/local/hadoop

Then change the following files

$ vi etc/hadoop/hdfs-site.xml

1
2
3
4
5
6
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

$ vi etc/hadoop/core-site.xml

1
2
3
4
5
6
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

$ vi etc/hadoop/yarn-site.xml

1
2
3
4
5
6
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

$ vi etc/hadoop/mapred-site.xml

1
2
3
4
5
6
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Start Hadoop Services

Format HDFS
$ cd /usr/local/hadoop
$ bin/hdfs namenode -format

Start HDFS
$ sbin/start-dfs.sh

Start YARN
$ sbin/start-yarn.sh

Validation

Check HDFS file Directory
$ bin/hdfs dfs -ls /

If you don’t like to include the bin/ every time you run a hadoop command, you can do the following

$ vi ~/.bash_profile
append this line to the end of the file “export PATH=$PATH:/usr/local/hadoop/bin”
$ source ~/.bash_profile

Now try to add the following two folders in HDFS that is needed for MapReduce job, but this time, don’t include the bin/.

$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/{your username}

You can also open a browser and access Hadoop by using the following URL
http://localhost:50070/

Next: Spark

Installing Spark is a little easier. You can download the latest Spark here:
http://spark.apache.org/downloads.html

It’s a little tricky on choosing which package type. We want to choose “pre-build with user provided Hadoop [can use with most Hadoop distributions]” type, and the downloaded file name is spark-1.6.1-bin-without-hadoop.tgz

After spark is downloaded, we need to untar it. Open a terminal window and do the following:

$ cd ~/Downloads
$ tar xzvf spark-1.6.1-bin-without-hadoop.tgz
$ mv spark-1.6.1-bin-without-hadoop /usr/local/spark

Add spark bin folder to PATH

$ vi ~/.bash_profile
append this line to the end of the file “export PATH=$PATH:/usr/local/spark/bin”
$ source ~/.bash_profile

What about Scala?

Spark is written in Scala, so even though we can use Java to write Spark code, we want to install Scala as well.

Download Scala from here: http://www.scala-lang.org/download/
Choose the first one to download Scala in binary, and the downloaded file is scala-2.11.8.tar

Untar Scala and move it to a dedicated folder

$ cd ~/Downloads
$ tar xzvf scala-2.11.8.tar
$ mv scala-2.11.8 /usr/local/scala

Add Scala bin folder to PATH

$ vi ~/.bash_profile
append this line to the end of the file “export PATH=$PATH:/usr/local/scala/bin”
$ source ~/.bash_profile

Now you should be able to do the following to access Spark shell for Scala

$ spark-shell

That’s it! Happy coding!

Popular File Formats for Hadoop/HDFS

Hadoop is an ecosystem including many tools to store and process big data. HDFS is one that is used for storage, and it’s a special file system different than the one used on our desktop machines. We are not going to explain why it’s special, instead, we will introduce several special file formats supported by HDFS.

Text/CSV Files

CSV file is the most commonly used data file format. It’s the most readable and also ubiquitously easy to parse. It’s the choice of format to use when export data from an RDBMS table. However, human readable does not mean it’s machine readable. It has three major drawbacks when used for HDFS. First of all, all lines in a CSV file is a record, therefore, we should not include any headers or footers. In other word, CSV file cannot be stored in HDFS with any meta data. Second of all, CSV file has very limited support for schema evolution. Because the fields for each record are ordered, we are not able to change the orders. We can only append new fields to the end of each line. Last, CSV file does not support block compression which many other file formats support. The whole file has to be compressed and decompressed for reading, adding a significant read performance cost to the files.

JSON Files

JSON is in text format that stores meta data with the data, so it fully supports schema evolution. You can easily add or remove attributes for each datum. However, because it’s text file, it doesn’t support block compression.

Avro Files

Avro File is serialized data in binary format. It uses JSON to define data types, therefore it is row based. It is the most popular storage format for Hadoop. Avro stores meta data with the data, and it also allows specification of independent schema used for reading the files. Therefore, you can easily add, delete, update data fields by just creating a new independent schema. Also, Avro files are splittable, support block compression and enjoys a wide arrange of tool support within Hadoop ecosystem.

Sequence Files

Sequence files are binary files with a CSV-like structure. It does not store meta data, nor does it support schema evolution, but it does support block compression. Due to its unreadability, they are mostly used for intermediate data storage within a sequence of MapReduce jobs.

ORC Files

RC files or Record Columnar files are columnar file format. It’s great for compression and best for query performance, with the sacrifice of cost of more memory and poor write performance. ORC are optimized RC files that works better with Hive. It compresses better, but still does not support schema evolution. It is worthwhile to note that OCR is a format primarily backed by Hortonworks, and it’s not supported by Cloudera Impala.

Parquet Files

Paquet file format is also a columnar format. Just like ORC file, it’s great for compression with great query performance. It’s especially efficient when querying data from specific columns. Parquet format is computationally intensive on the write side, but it reduces a lot of I/O cost to make great read performance. It enjoys more freedom than ORC file in schema evolution, that it can add new columns to the end of the structure. It is also backed by Cloudera and optimized with Impala.

Since Avro and Parquet have so much in common, let’s review a little bit more of both. When choosing a file format to use with HDFS, we need to consider read performance and write performance. Because the nature of HDFS is to store data that is write once, read multiple times, we want to emphasize on the read performance. The fundamental difference in terms of how to use either format is this: Avro is a Row based format. If you want to retrieve the data as a whole, you can use Avro. Parquet is a Column based format. If your data consists of lot of columns but you are interested in a subset of columns, you can use Parquet.

Big Data Everywhere

If your work is related to Big Data, but you have not heard of Big Data Everywhere Conference, don’t panic, chances are that you are just not using MapR Hadoop. This is an event sponsored by MapR and its many partners. However, the topics cover all area of Big Data, and you won’t feel discriminated if you have only been using Cloudera or Hortonworks.

The conference is held in many cities many times of a year and the one I attended is in San Diego on April 12, 2016. Traffic was really bad on Interstate 5 from Orange County to San Diego that morning, and I was 2 hours on the road and 45 minutes late. The breakfast provided was really good, so I decided to spend the 15 minutes eating instead of socializing with a full room of talented data professionals.

A full agenda is shown in the following picture, and I will summarize all the talks in this event based on my own written notes since the organizer still has not sent out the official presentation decks.

[Update 4/14/2016] Presentation full deck for the talks is available now.

FullSizeRender 5

First speaker is Jim Scott, director of enterprise strategy and architecture from MapR. His topic is Streaming in the Extreme. First he explained what is the enterprise architecture with a circular diagram he drew himself covering all area of company data strategy, with an emphasis on the fact that solution architecture is not equal to enterprise architecture. Later he introduced a streaming process he implemented using MapR streaming, which, according the statistics provided, beats Apache Kafka. When being asked if he considers MapR streamong is the best among all similar technology, including Flink, Spark, Apex, Storm, etc., Jim gave the opinion that MapR streaming is definitely the best when used with MapR Hadoop.

IMG_0243

Next on the stage is Alex Garbarini, information technology engineer from Cisco, and his topic is Build and Operationalize Enterprise Data Lake in Big Enterprise. Being a technology company, Cisco was able to implement a data lake themselves using Hadoop that handles 2 billion records on a daily basis. The data lake is now a hub for multiple business usage including the analysis of Webex user activities.

Right after a talk about data lake, is a topic titled Going Beyond Data Lake. Vik Kapoor, director of analytics technology architecture and platforms from Pfizer talked about how they leverage the entire analytics ecosystem. They formed their practices following 4 steps: find, explore, understand, and share, which going through the data load, data wrangling, data discovery and evaluation processes and builds data products as a result. He also introduced the tools they are using for each step.

Coming up next is a panel discussion. Scott Saufferer and Robert Warner from ID Analytics answered interview questions from a host. The director of data operations and director of engineers took turns to tell the audience how they introduced Hadoop into their company, and how both teams collaborate to make the best of it.

Next on the stage is Alex Bates, a soft spoken CTO from Mtell, talking about hardware – IoT. Mtell manufactures smart machines with sensors built inside to transmit data about the status of the machines. Data is collected and processed by apps written in Spark. With the help of machine learning, they learned a lot about the machines, and created different agents to monitor anomalies and prevent failures. Also, RESTful API are created for different clients to integrate this with their monitoring tools.

When data architects and data scientists are fighting for the driver’s seats of big data groups within any organizations, it’s only fair to invite speakers from both sides in any big data conferences. Allen Day, chief scientist from MapR, contributor of many open source projects and machine learning algorithm implementation, did an awesome job explaining how to build a Genome Analysis Pipeline in simple words and diagrams that people with little knowledge of data science can understand. For those who wanted to dig deeper into this, he also provided the git link to the source code: https://github.com/allenday/spark-genome-alignment-demo.

Last but not least, Energetic Stefan Groschupf, CEO of Datameer jumped on the stage and gave a speech about how to jumpstart a big data project for any organizations. As a seasoned entrepreneur, he has a lot of experience in running an organization and his advice is simple and straightforward. Instead of spending a whole lot of money on latest technology, he suggested that a small team within the company to be formed. The members should be cross functional with different types of employees including the visionary, the reality check, the challengers and the worker bees. The team will focus on problems within the company before they bring up discussion of innovative idea or other people’s use cases. As big data project goes, the team will find a pain point, identify a few possible solutions, start from one small angle to approach it, try out different tools to tackle it as a proof of concept. A process with a successful result will then be scaled up into a full solution that could bring even more value to the company. And the core team members will also become the implementers and managers of the new process.

Believe it or not, this is exactly the kind of idea I approach my clients with as a big data consultant, and I’ve seen them become more and more confident and successful with what they are doing within a couple of years. “Great minds think alike!” That is a great feeling to go home with after a long half-day event.

IMG_0266