September 21, 2017

The data lake revolution - what it offers to your organization

Tuukka Arola

Lead Big Data Architect, Tieto

Talk to many IT pros today and one term you're likely to hear quickly is data lakes. In an environment where the amount of information available to businesses has grown hugely in recent years, organisations need efficient ways of managing and analysing this information - and the answer is increasingly recognised as being data lakes.

But while people may be talking about data lakes everywhere, it's a fact that overall, there's still a relative lack of understanding about what they actually are, how they differ from other technologies, and what businesses need to do to make the most of them.

So therefore, in my next few blogs, I thought I'd take some time to break down what can be a highly complex subject and answer some of the key questions I frequently hear about this technology.

Later on, I'll go over some key issues such as the security implications of a data lake strategy and what organisations need to do to implement them effectively from an end-user perspective, but for now, let's focus on the basics of what a data lake actually is.

A single source of insight

At its most basic level, a data lake offers businesses a single repository where they can collate and store very large quantities of data from various sources and in a wide range of different formats.  

Unlike many traditional database solutions, data lakes aren't limited to certain formats of data. They could, for example, contain image and audio files, text-heavy PDFs and unstructured data collected by Internet of Things sensors.

How they differ from data warehouses

One of the most common questions about data lakes is how they differ from traditional data warehousing solutions that have been around for many years. While there are a range of differentiating factors, data warehouses tend to be more rigidly defined, with inputs limited to specific types of structured data, on which users can build an overarching model for analysing this data that can be shared throughout the enterprise.

While such solutions can be very powerful, they are rather limited in scope compared with what a data lake can offer. So for example, within a data warehouse model, it may be very difficult to change certain parameters for an analysis without causing knock-on effects that can break the model.

On the other hand, data lakes give users the agility they need to create more ad-hoc analytics solutions that are more focused on specific use cases, rather than an all-encompassing analytical model. So if a business needs to find an answer to a particular query, it can build a new solution to do this very quickly.

A more programmable solution

To put it another way, a data lake is not so much a warehousing platform as it is a coding platform. Under a traditional data warehousing system, how a company would draw out insight would usually be for a business intelligence professional to create an SQL query for the database. But while a data warehouse is well suited to providing answers to such queries, a data lake can do so much more.

A data lake enables users to create full programs to analyse structured and unstructured data, taking advantage of modern languages such as Scala, Python and R. This makes it possible for businesses to take full advantage of the libraries that have been created for these languages. For example, there are some very useful statistical libraries available for R and Python that data lake users can leverage, while also turning to the likes of Scala or Java for intensive data processing requirements.

This can make a huge difference to the quality of insight businesses are able to get from their data. But of course, with this increase in capabilities, organisations also need to look at the skills they have at their disposal.

A business intelligence pro may be able to run SQL queries on a data lake as easily as they would on a data warehouse, but they may find it difficult to bring in advanced programming until they have experience in the relevant languages. This is where data scientists come in, and these personnel are likely to be in very high demand in the coming years as more firms recognise the potential of the data lake.

What this means is that when it comes to conducting advanced data exploration activities, a data lake is by far the best solution. But of course, moving to such technology brings with it its own set of challenges, such as security, which I'll cover in my next blog.

Read also: Better healthcare through better data

Stay up-to-date

Get all the latest blogs sent you now!