The “Panama Papers” has dropped a bomb in the world as the largest information leak, showing the hidden treasure of famous figures including government officials of many countries. Comparing to the 1.7 GB Wikileak and the scandal of 30 GB Ashley Madison data, the leak of 11.5 million documents of 2.6 TB data definitely gives it the name of the “Leak of the Big Data era”!
It took about 400 German journalist 2 years to dig out valuable information from the vast amount of data leaked from one of the world’s largest firm that handles incorporation offshore entities – Mossack Fonseca. The journalists definitely did not read through all the documents. According to a Forbes report, latest technology has been used to parse and index these files for quick searching capability. The tools are open-source software and commonly used in today’s big data practice. They are Apache Tika and Apache Solr.
Apache Tika is used to detect and extract metadata and text from multiple file types including PPT, XLS, and PDF. All files can be parsed through a single interface, making Tika useful for search engine indexing.
Apache Solr is an enterprise search platform that can be used to index text in all types, making full-text search on unstructured data very easy.
With the combination of these useful tools, the 2.6 TB of files are properly indexed, and stored in Amazon cloud, providing near real-time search results. Also, graph database and visualization tools are used to define the complex connections of entities and their relationships. A customized user interface is developed for the journalists to easily navigate through the files to find interesting points, connect the dots, and shock the world.
Making data easily accessible is just the first step. Big data is also about applying business knowledge into mining the vast amount of data util a valuable result can be drawn. A full release of all the Panama Papers to the public will happened in early May. Big data developers, hone your skills and get prepared.