Popular File Formats for Hadoop/HDFS

Hadoop is an ecosystem including many tools to store and process big data. HDFS is one that is used for storage, and it’s a special file system different than the one used on our desktop machines. We are not going to explain why it’s special, instead, we will introduce several special file formats supported by HDFS.

Text/CSV Files

CSV file is the most commonly used data file format. It’s the most readable and also ubiquitously easy to parse. It’s the choice of format to use when export data from an RDBMS table. However, human readable does not mean it’s machine readable. It has three major drawbacks when used for HDFS. First of all, all lines in a CSV file is a record, therefore, we should not include any headers or footers. In other word, CSV file cannot be stored in HDFS with any meta data. Second of all, CSV file has very limited support for schema evolution. Because the fields for each record are ordered, we are not able to change the orders. We can only append new fields to the end of each line. Last, CSV file does not support block compression which many other file formats support. The whole file has to be compressed and decompressed for reading, adding a significant read performance cost to the files.

JSON Files

JSON is in text format that stores meta data with the data, so it fully supports schema evolution. You can easily add or remove attributes for each datum. However, because it’s text file, it doesn’t support block compression.

Avro Files

Avro File is serialized data in binary format. It uses JSON to define data types, therefore it is row based. It is the most popular storage format for Hadoop. Avro stores meta data with the data, and it also allows specification of independent schema used for reading the files. Therefore, you can easily add, delete, update data fields by just creating a new independent schema. Also, Avro files are splittable, support block compression and enjoys a wide arrange of tool support within Hadoop ecosystem.

Sequence Files

Sequence files are binary files with a CSV-like structure. It does not store meta data, nor does it support schema evolution, but it does support block compression. Due to its unreadability, they are mostly used for intermediate data storage within a sequence of MapReduce jobs.

ORC Files

RC files or Record Columnar files are columnar file format. It’s great for compression and best for query performance, with the sacrifice of cost of more memory and poor write performance. ORC are optimized RC files that works better with Hive. It compresses better, but still does not support schema evolution. It is worthwhile to note that OCR is a format primarily backed by Hortonworks, and it’s not supported by Cloudera Impala.

Parquet Files

Paquet file format is also a columnar format. Just like ORC file, it’s great for compression with great query performance. It’s especially efficient when querying data from specific columns. Parquet format is computationally intensive on the write side, but it reduces a lot of I/O cost to make great read performance. It enjoys more freedom than ORC file in schema evolution, that it can add new columns to the end of the structure. It is also backed by Cloudera and optimized with Impala.

Since Avro and Parquet have so much in common, let’s review a little bit more of both. When choosing a file format to use with HDFS, we need to consider read performance and write performance. Because the nature of HDFS is to store data that is write once, read multiple times, we want to emphasize on the read performance. The fundamental difference in terms of how to use either format is this: Avro is a Row based format. If you want to retrieve the data as a whole, you can use Avro. Parquet is a Column based format. If your data consists of lot of columns but you are interested in a subset of columns, you can use Parquet.


Statistics vs. Machine Learning

I was being asked the question “What’s the difference between statistics and machine learning?” quite a lot lately, almost as often as this one “What’s the difference between a data analyst and a data scientist”, (which I might write about in another blog.) People wondering about the differences between these subjects probably see a lot of similarity between the two: They are both means to learn about the data, and they share many of the same methods.

The fundamental difference about the two is: statistics is focused on inference and conclusions while machine learning emphasizes on predictions and decisions.

Statisticians care deeply about the data collection process, methodology and statistical properties of the estimator. They are interested in learning something about the data. Statistics may support or reject hypothesis based on the noise of the data, validate models, or make forecasts, but overall the goal is to arrive at a new scientific insight based on the data. It other word, it wants to draw a valid and precise conclusion on problems proposed.

Machine Learning is about making a prediction, and algorithm is just a means to the end. The goal is to solve complex computational task by feeding data to a machine so it will tell us what the outcome will be. Instead of figuring out the cause and effect, we will collect a large amount of examples of what the mechanism should be, and then run an algorithm which is able to perform the task by learning from the examples. it builds model to predict a result, and use data to improve its prediction.

You may have realized that quite a few algorithms used in machine learning are statistical in nature, but as long as the prediction works well, any kind of statistical insight into the data is not necessary.

A paper Statistical Modeling: The Two Cultures published by Leo Breiman in the year 2001 explains the differences between statistics and machine learning very well. I’m going to post the abstract here:


There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.”

In this paper, two cultures are introduced and we can treat Data Modeling Culture as Statistics and Algorithmic Modeling Culture as Machine Learning. (The term Machine Learning still resides mostly in science fictions in 2001.) The following two pictures show clearly the difference between the two cultures.

Data modeling culture assumes a data model and estimates the values from the parameters using the data model.

Screen Shot 2016-04-19 at 11.55.11 PM

Algorithmic modeling treats the true algorithm inside the box complex and unknown. It creates another algorithm that operates on x to predict y.

Screen Shot 2016-04-19 at 11.55.35 PM

Now I wonder if there will be anyone who can’t wait for my blog about the other frequently asked question, and pop this one: “What about Data Science?”

Data Science employs all the techniques and theories drawn from many fields including mathematics, statistics, information science, computer science, which also includes machine learning, data mining, predictive analytics, etc. to extract knowledge or insights from data. Data scientist is not a new fancy title on name cards; he is a true master of data.

Big Data Everywhere

If your work is related to Big Data, but you have not heard of Big Data Everywhere Conference, don’t panic, chances are that you are just not using MapR Hadoop. This is an event sponsored by MapR and its many partners. However, the topics cover all area of Big Data, and you won’t feel discriminated if you have only been using Cloudera or Hortonworks.

The conference is held in many cities many times of a year and the one I attended is in San Diego on April 12, 2016. Traffic was really bad on Interstate 5 from Orange County to San Diego that morning, and I was 2 hours on the road and 45 minutes late. The breakfast provided was really good, so I decided to spend the 15 minutes eating instead of socializing with a full room of talented data professionals.

A full agenda is shown in the following picture, and I will summarize all the talks in this event based on my own written notes since the organizer still has not sent out the official presentation decks.

[Update 4/14/2016] Presentation full deck for the talks is available now.

FullSizeRender 5

First speaker is Jim Scott, director of enterprise strategy and architecture from MapR. His topic is Streaming in the Extreme. First he explained what is the enterprise architecture with a circular diagram he drew himself covering all area of company data strategy, with an emphasis on the fact that solution architecture is not equal to enterprise architecture. Later he introduced a streaming process he implemented using MapR streaming, which, according the statistics provided, beats Apache Kafka. When being asked if he considers MapR streamong is the best among all similar technology, including Flink, Spark, Apex, Storm, etc., Jim gave the opinion that MapR streaming is definitely the best when used with MapR Hadoop.


Next on the stage is Alex Garbarini, information technology engineer from Cisco, and his topic is Build and Operationalize Enterprise Data Lake in Big Enterprise. Being a technology company, Cisco was able to implement a data lake themselves using Hadoop that handles 2 billion records on a daily basis. The data lake is now a hub for multiple business usage including the analysis of Webex user activities.

Right after a talk about data lake, is a topic titled Going Beyond Data Lake. Vik Kapoor, director of analytics technology architecture and platforms from Pfizer talked about how they leverage the entire analytics ecosystem. They formed their practices following 4 steps: find, explore, understand, and share, which going through the data load, data wrangling, data discovery and evaluation processes and builds data products as a result. He also introduced the tools they are using for each step.

Coming up next is a panel discussion. Scott Saufferer and Robert Warner from ID Analytics answered interview questions from a host. The director of data operations and director of engineers took turns to tell the audience how they introduced Hadoop into their company, and how both teams collaborate to make the best of it.

Next on the stage is Alex Bates, a soft spoken CTO from Mtell, talking about hardware – IoT. Mtell manufactures smart machines with sensors built inside to transmit data about the status of the machines. Data is collected and processed by apps written in Spark. With the help of machine learning, they learned a lot about the machines, and created different agents to monitor anomalies and prevent failures. Also, RESTful API are created for different clients to integrate this with their monitoring tools.

When data architects and data scientists are fighting for the driver’s seats of big data groups within any organizations, it’s only fair to invite speakers from both sides in any big data conferences. Allen Day, chief scientist from MapR, contributor of many open source projects and machine learning algorithm implementation, did an awesome job explaining how to build a Genome Analysis Pipeline in simple words and diagrams that people with little knowledge of data science can understand. For those who wanted to dig deeper into this, he also provided the git link to the source code: https://github.com/allenday/spark-genome-alignment-demo.

Last but not least, Energetic Stefan Groschupf, CEO of Datameer jumped on the stage and gave a speech about how to jumpstart a big data project for any organizations. As a seasoned entrepreneur, he has a lot of experience in running an organization and his advice is simple and straightforward. Instead of spending a whole lot of money on latest technology, he suggested that a small team within the company to be formed. The members should be cross functional with different types of employees including the visionary, the reality check, the challengers and the worker bees. The team will focus on problems within the company before they bring up discussion of innovative idea or other people’s use cases. As big data project goes, the team will find a pain point, identify a few possible solutions, start from one small angle to approach it, try out different tools to tackle it as a proof of concept. A process with a successful result will then be scaled up into a full solution that could bring even more value to the company. And the core team members will also become the implementers and managers of the new process.

Believe it or not, this is exactly the kind of idea I approach my clients with as a big data consultant, and I’ve seen them become more and more confident and successful with what they are doing within a couple of years. “Great minds think alike!” That is a great feeling to go home with after a long half-day event.


Panama Papers and Big Data

The “Panama Papers” has dropped a bomb in the world as the largest information leak, showing the hidden treasure of famous figures including government officials of many countries. Comparing to the 1.7 GB Wikileak and the scandal of 30 GB Ashley Madison data, the leak of 11.5 million documents of 2.6 TB data definitely gives it the name of the “Leak of the Big Data era”!

It took about 400 German journalist 2 years to dig out valuable information from the vast amount of data leaked from one of the world’s largest firm that handles incorporation offshore entities – Mossack Fonseca. The journalists definitely did not read through all the documents. According to a Forbes report, latest technology has been used to parse and index these files for quick searching capability. The tools are open-source software and commonly used in today’s big data practice. They are Apache Tika and Apache Solr.

Apache Tika is used to detect and extract metadata and text from multiple file types including PPT, XLS, and PDF. All files can be parsed through a single interface, making Tika useful for search engine indexing.

Apache Solr is an enterprise search platform that can be used to index text in all types, making full-text search on unstructured data very easy.

With the combination of these useful tools, the 2.6 TB of files are properly indexed, and stored in Amazon cloud, providing near real-time search results. Also, graph database and visualization tools are used to define the complex connections of entities and their relationships. A customized user interface is developed for the journalists to easily navigate through the files to find interesting points, connect the dots, and shock the world.

Making data easily accessible is just the first step. Big data is also about applying business knowledge into mining the vast amount of data util a valuable result can be drawn. A full release of all the Panama Papers to the public will happened in early May. Big data developers, hone your skills and get prepared.

Big Data Blog

I love to learn new technology and write about it. It is a great way to keep myself abreast of the latest and greatest of the technology world. I found that if you have to write about something and post it for the whole world to read, you force yourself to dig deeper and read more so you can be confident about every word you say about the technology. By doing that, you become a subject matter expert!

I created a blog called Geek at Work in 2010, but only after I posted a few blogs, I lost my password and never touched this site. I finally get the site back recently and decided to keep writing. I changed the name of this blog to Big Data of Everything, as you can tell, I will be writing about Big Data, the industry, the tools, algorithms, use cases, and how it will change our world.

I will start by collecting some of the articles I wrote about Big Data in recent years.

Stay tuned!



On Friday night, I attended a seminar in which Director of Market Data from InterContinental Exchange (ICE), David Chen gave an introduction on how the Stock and Future Contract Exchange works. It’s the first time I was exposed to such technology and it is pretty interesting.

During his talk, he mentioned that Multicast is used to distribute data from the Exchange to multiple client applications.

Multicast addressing is a network technology for the delivery of information to a group of destinations simultaneously using the most efficient strategy to deliver the messages over each link of the network only once, creating copies only when the links to the multiple destinations split.

The word “multicast” is typically used to refer to  IP multicast, which is often employed for media streaming applications. In IP multicast the implementation of the multicast concept occurs at the IP routing level, where routers create optimal distribution paths for datagrams sent to a multicast destination address spanning tree in real-time.

The most common protocol used in an IP network is TCP – Transmission Control Protocol, which is one of the main protocols in TCP/IP networks. Whereas the IP protocol deals only with packets, TCP enables two hosts to establish a connection and exchange streams of data. TCP guarantees delivery of data and also guarantees that packets will be delivered in the same order in which they were sent.

Since TCP requires a dedicated connection between two IPs, which creates a lot of overhead in network traffic, it is not good for Multicast. Mr. Chen did not say what protocol is used, but I believe it must be UDP.

UDP stands for User Datagram Protocol. It provides a connectionless host-to-host communication path. UDP has minimal overhead; each packet on the network is composed of a small header and user data. It is called a UDP datagram.

UDP preserves datagram boundaries between the sender and the receiver. It means that the receiver socket will receive an OnDataAvailable event for each datagram sent and the Receive method will return a complete datagram for each call. If the buffer is too small, the datagram will be truncated. If the buffer is too large, only one datagram is returned, the remaining buffer space is not touched.

UDP is connectionless. It means that a datagram can be sent at any moment without prior advertising, negotiation or preparation. Just send the datagram and hope the receiver is able to handle it.

The biggest disadvantage of UDP is that it is an unreliable protocol. There is absolutely no guarantee that the datagram will be delivered to the destination host. Although the failure rate is very low on the Internet and nearly null on a LAN unless the bandwidth is full, packet loss prevention if crucial in transmitting data from Exchange to client applications. Unfortunately, Mr. Chen did not disclose what method is used in their Exchange, which must be a core technology all Exchange companies spend money to improve and protect.

Gantt and PERT Charts

Gantt and PERT charts are visualization tools commonly used by project managers to control and administer the tasks required to complete a project.

The Gantt chart, developed by Charles Gantt in 1917, focuses on the sequence of tasks necessary for completion of the project at hand. Each task on a Gantt chart is represented as a single horizontal bar on an X-Y chart. The horizontal axis (X-axis) is the time scale over which the project will endure. Therefore, the length of each task bar corresponds to the duration of the task, or the time necessary for completion. Arrows connecting independent tasks reflect the relationships between the tasks it connects. The relationship usually shows dependency where one task cannot begin until another is completed. The resources necessary for completion are also identified next to the chart. The Gantt chart is an excellent tool for quickly assessing the status of a project. The following Gantt chart was developed using MS Project for developing a proposal.

gantt.jpg (170856 bytes)

Making this chart is a pretty self explanatory task. Almost all controls are available by double clicking task names in the column on the left. This chart shows the resources, completion (shown by the horizontal black line within the task bar), and prerequisite relationships….all controllable through double clicking appropriate task name on the left. You can change the time scale on the top by right click….time scale option. Its basically controlled by typical Microsoft actions used in any MS application.

PERT (Program Evaluation and Review Technique) charts were first developed in the 1950s by the Navy to help manage very large, complex projects with a high degree of intertask dependency. Classical PERT charting is used to support projects that are often completed using an assemply line approach. MS Project can create a PERT chart from a Gantt chart. The PERT below is another representation of the Proposal project shown above.

pert.jpg (151487 bytes)

Again, the representation above is relatively self explanatory. The completed tasks have been crossed out while partially completed tasks have one slash through them. The tasks also show duration, beginning date and ending date.

The critical path (shown in red) is a series of tasks that must be completed on schedule for a project to finish on schedule. Each task on the critical path is a critical task. Most tasks in a typical project have some slack and can therefore be delayed a little without affecting the project finish date. Those tasks that cannot be delayed without affecting the project finish date are the critical tasks. As you modify tasks to resolve overallocations or other problems in your schedule, be aware of the critical tasks and that changes to them will affect your project finish date.