It seems that although several vendors are marketing data lakes to capitalise on big data opportunities, there is little understanding about what comprises a data lake, or how to get value from it.
“In broad terms, data lakes are marketed as enterprisewide data management platforms for analysing disparate sources of data in its native format,” said Nick Heudecker, research director at Gartner. “The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organisation.”?
The data lake concept hopes to solve two problems. The first regards information silos. Rather than having dozens of managed collections of data, you can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction.
At the same time, data lakes tackle big data initiatives. Big data projects require a large amount of varied information. The information is so varied that it’s not clear what it is when it is received, and constraining it in something as structured as a data warehouse or relational database management system (RDBMS) constrains future analysis.
“Addressing both of these issues with a data lake certainly benefits IT in the short term in that IT no longer has to spend time understanding how information is used data is simply dumped into the data lake,” saidAndrew White, vice president and analyst at Gartner. “However, getting value out of the data remains the responsibility of the business end user. Of course, technology could be applied or added to the lake to do this, but without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place.”
Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. And without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp.
Another risk is security and access control. Data can be placed into the data lake with no oversight of the contents. Many data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security capabilities of central data lake technologies are still embryonic. These issues will not be addressed if left to non-IT personnel.
“The fundamental issue with the data lake, however, is that it makes certain assumptions about the users of information” explained Heudecker. “It assumes that users recognise or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources and that they understand the incomplete nature of datasets, regardless of structure.”?
While these assumptions may be true for users working with data, such as data scientists, the majority of business users lack this level of sophistication or support from operational information governance routines. Developing or acquiring these skills or obtaining such support on an individual basis, is both time-consuming and expensive, or impossible.
“There is always value to be found in data but the question your organisation has to address is this do we allow or even encourage one-off, independent analysis of information in silos or a data lake, bringing said data together, or do we formalise to a degree that effort, and try to sustain the value-generating skills we develop said White. “If the option is the former, it is quite likely that a data lake will appeal. If the decision tends toward the latter, it is beneficial to move beyond a data lake concept quite quickly in order to develop a more robust logical data warehouse strategy.”