May 31, 2018 | By Shishir Shrivastava
Enterprises have been mining data for centuries, starting with handwritten book keeping ledgers, but data management really took off with the advent of the database management systems (DBMS) in the late 1970s. Corporations started stockpiling loads of data into the enterprise application system, and database software became the backbone. Enter vendors like Oracle, IBM, DB2, Teradata, MS SQL Server.
The operational systems grew rapidly, along with the data gathered. Companies wanted to use the data to help executives to make decisions quickly and effectively, but the systems weren’t efficient in providing analytical reports fast enough to make a real difference.
In the late 1980s, data warehousing emerged, brewing along with online transaction processing systems. Enterprises storing the data in relational databases were eager to use it for decision making. Top management in big enterprises wanted to jump on the bandwagon. There were many key factors driving the enterprises towards data warehousing, including:
The data warehouse outline
This paradigm of concocting data in enterprises brought a new field of study, data warehousing. The essence of this framework is pulling the data from source systems and offloading into another relational database called a data warehouse.
Diagram: Conventional data warehouse architecture
In this model, data is absorbed from various relational operational database systems and placed in a staging area. Once the data is brought in from the staging area, it’s processed, refined and transformed to suit the reporting needs. This cleansed data is then deployed in different data marts in order to streamline analytical reporting.
Custom data warehouses
This is a typical data warehousing architectural setup in most enterprises, which process their internally generated data via operational data stores. Since every industry, from finance to energy, has unique needs, the processing frameworks and paradigms are also unique. One product or solution that fits all sizes isn’t possible. Custom data warehouse provided that much-needed analytical solution not just to top executives but also the department/cost-center heads. Decision-making became faster due to readily available preprocessed data from the operational systems. This technique not only resolved the complexities involved in providing analytical reports but also reduced the duration of delivery. The precooked data significantly reduced design and development efforts of analytical reports. This also settled the issues related to resource intensive analytical reports emanating out of operational data stores. Report rendering is now rapid.
As enterprises have now started demanding more from data, new fields like predictive analytics and business intelligence have started to germinate. Smaller enterprises have also begun investing in building their own custom data warehouses, but early adoption has given many advantages to the enterprises.
Prebuilt data warehouses
Amidst growth in database software, enterprise applications development were also maturing, and new ERP software came along: JD Edwards, SAP, Oracle Applications, PeopleSoft, Siebel and more. Meanwhile, building enterprise specific data warehousing solution had its own challenges:
Since ERPs have definitive processes and standards, it became popular to design prebuilt data warehouse framework on top of the standard ERP applications instead of designing custom warehouse for individual enterprises. These prebuilt warehouses have become favored choices because of low cost of ownership, faster deployment and little need for customization.
As companies become more customer-centric, they seek to better understand customers create better user experience. This has necessitated capturing data from outside the organization through various channels, like email, web, text messages, social media, audio, video. And it’s a huge volume of data that’s not just structured in relational formats, but also unstructured. This is putting lots of pressure on the data warehousing model, both in storage and analysis needs.
Processing varied forms of data and in humongous size has uncovered limitations in traditional data warehousing:
These limitations on data warehouses compelled the industry to look for a better solution. With the onset of the technology revolution around massive parallel processing (MPP), it’s becoming faster to crunch enormous data. Unlike conventional data warehousing, the approach towards capturing data has changed in modern day data analytics. Organizations now focus more on mining data using parallel processing techniques in its native form and using state-of-the-art open source tools to pull the data and provide analytical reporting. This whole setup is popularly known as “data lake,” a term coined by James Dixon, the founder of Pentaho BI tools.
Diagram: Data lake architecture overview
The benefits of data lakes are enormous:
Open source technology has given huge impetus to the research and innovation in this space. There are abundant open source tools available in the market to perform analysis of data on a large scale.
Some tools for managing the data lake ecosystem:
I have firsthand experience working closely with a manufacturing customer who is implementing a data lake to streamline data management. It’s a large organization with multiple enterprise applications, each serving respective business units, and with multiple locations across the globe. They had isolated data warehouses for each enterprise application to get the analytical reports for business insights. Clearly, the holistic picture of data and direction for the entire organization was missing. They revamped the data strategy and implemented a data lake to enable unified enterprise-level BI reporting. This also made it possible to pull external non-organizational data to get a better perception of customer sentiment.
Happy with the success of the data lake investment, the customer’s IT organization is exploring various options to streamline real-time reporting.
Our engineering team at TEKsystems’ research and innovation lab has been working with variety of tools and putting together a state-of-the-art, low total cost-of-ownership and modern data lake architecture. As I write here, there are many accelerators and smart Tools being designed for rapid deployment of data lakes. Following the Bell curve, after a traditional custom data warehouse, we’ve seen the dawn of the prebuilt data warehouse. This is the age of custom data lakes … let’s wait and watch the prebuilt data lake story unfold!