Building data platforms and deliverying advanced analytical services in the new age of data intelligence can be a daunting task. It’s not really helping with all the tools and methodologies that we know we can use. Therefore, a reference architecture is needed to provide guidelines for the process design and best practices for advanced analytics, so we can not only meet the business requirement, but also bring more value to the business.
1. Architectural Guidance
- The architecture should cover all building blocks including the following: Data Infrastructure, Data Engineering, Traditional Business Intelligence, and Advanced Analytics. Within Advanced Analytics, we should include machine learning, deep learning, data science, predictive analytics, and the operationalization of models.
- One of the first steps should be finding the gaps between current infrastructure, tools, technologies and the end state environment.
- We need to create a unified approach to both structured and unstructured data. It’s perfectly fine to maintain two different environments for structured and unstructured data, although both systems will look more and more close to each other.
- Rome is not built in one night. We need to first build a road map, with budget in mind, on how the organization can get to the end state, adept and/or pivot whenever needed along the way.
2. Best Practices
- There is never one best solution for all. A different scenario will have its very own best approach. However, we can create standard approaches for different categories. Creating best practices for different categories or industries and make them options, it is by itself a best practice.
- Things we need to consider when suggesting a best practice includes company size, current infrastructure, skillsets of existing IT personnel.
3. Framework for Solutions.
- A reference architecture for Advanced Analytics is depicted in the following diagram. On the bottom of the picture are the data sources, divided into structured and unstructured categories. Structured data are mostly operational data from existing ERP, CRM, Accounting, and any other systems that create the transactions for the business. They are handled by relational databases RDBMS such as Oracle, Teradata, and MS SqlServer. The RDBMS can be used as the backend for applications which produce these transactions, and they are called OLTP – online transactional processing system. Periodically the transactional data will be copied over to data stores for analytical and reporting purpose. These data stores are also built on RDBMS, and they are called OLAP – online analytcal processing sytem. On top of data warehouse is business intelligence and data visualization. We have quite a few powerful tools to support this capability.
- On the right side, where the unstructured data is processed, that’s the big data world. Just as for structured data, there is a variety of tools that we can use for ETL (Extract-Transform-Load) of the data into selected data platforms, which include Hadoop, NoSQL, and all those cloud based storage systems. Data is ingested into these filesystem based data stores, and is then processed by multiple analytical tools. The analytical results are either fed to the data visualization tools, or operationalized by APIs created using all kinds of technology.
- Demanding for streaming process is also growing tremendously, which requires real-time or near real-time analytics of vast amount of data to identify treads, find anomalies, and predict results. A few tools that can be used in this category is recommended.
4. The tools I recommend for multiple data processing purposes are listed as follows. (So they can be search-engine-friendly, even though they are all listed in the picture above.)
- Data Ingestion (Paxata, Pentaho, Talend, informatica…)
- Data Storage (Cloudera, Hortonworks, MapR Hadoop, Cassandra, HBase, MongoDB, S3, Google CloudPlatform…)
- Data Analytics (Python, R, H2O (ML and DL), TensorFlow (GPU optional), Databricks/Spark, Gaffe (GPU), Torch (GPU) for for deep learning of image and sound)
- Data Visualization (Tableau, Qlikview, Cisco DV…)