Decoder: Data lake

Decoder

Data lake

A data lake is a repository — typically a large one — for storing data of many types.

Data lakes are systems that store vast quantities of data. Typically, they’re built with the aim of improving corporate decision making.

Data lakes are more flexible and faster than traditional data warehouses.

What is it?

A large repository of data of all types. You should be mindful that you have a plan to deliver value from your data lake before building one.

Learn more

What’s in it for you?

Data lakes were intended as repositories for vast quantities of data that could be analyzed faster and produce more real-time insights than traditional data warehouses, with greater flexibility to do new types of analyses.

Learn more

What are the trade-offs?

Many enterprises failed to generate a return on their investment because they had quality issues with the data in their lakes or had invested significant sums in creating their lakes before identifying use cases.

Learn more

How is it being used?

Data lakes are frequently used to store and process big data.

Learn more

What is it?

A data lake is a repository for storing large quantities of raw data. That data could come from all corners of the enterprise, ranging from structured operational and transactional data systems that run the business to unstructured external data for things like customer preferences.

They were initially seen as an improvement to traditional data warehouses, which typically needed data to be treated before being stored and where trying to do new types of analysis was slow because it required building and feeding new data into the warehouse.

Data lakes solved those problems by stressing the need to capture data first, in its raw state, and analyzing it later.

Unfortunately data lakes, while solving some of the problems with data warehouses, still didn’t solve the most critical problem — extracting value from the data.

Capturing data and storing it in a lake doesn’t really address the challenge of getting value from that data. Many organizations have been disappointed with their data lakes because of data quality issues: with no curation of the data going into the lake, you can create problems such as duplication and poor data quality.

What’s in for you?

Data lakes are more flexible and faster than traditional data warehouses. Done well, data lakes provide a way of storing big data, which can then be analyzed, enabling you to gain new insights — perhaps into business performance or identifying new customer trends.

Data lakes have also been useful in enabling companies to add large public data sets into their analyzes — perhaps using weather data to see the impact of good weather on their retail business or mapping data to optimize transportation routes for your supply chain.

What are the trade offs?

There is an old rule of thumb that says, “data that isn’t used will go bad, just like ripe bananas.” Whether you are building a data warehouse, a data lake, or a data mesh, building them without identifying how the data will be used is risky. If the source data is a mess when fed into a data lake, it will still be a mess when you try to work with it.

Done right, with appropriate emphasis on data use, a data lake can be a useful technology in your data plans.

Many organizations have been disappointed with their data lake investments because they didn’t do that upfront planning on how the data in their lake would be used. If you build a valuable use case upfront, you’ll find the investment in building a data lake generates a return sooner.