Alan R. Simon

Data Lakes For Dummies


Скачать книгу

lake?

      

Data lakes mostly use a technique called ELT, which stands for either extract, transform, and load or extraction, transformation, and loading. With ELT, you “blast” your data into a data lake without having to spend a great deal of time profiling and understanding the particulars of your data. You extract data (the E part of ELT) from its original home in a source application, and then, after that data has been transmitted to the data lake, you load the data (the L) into its initial storage location. Eventually, when it’s time for you to use the data for analytical purposes, you’ll need to transform the data (the T) into whatever format is needed for a specific type of analysis.

For data warehousing — the predecessor to data lakes that you’re almost certainly still also using — data is copied from source applications to the data warehouse using a technique called ETL, rather than ELT. With ETL, you need to thoroughly understand the particulars of your data on its way into the data warehouse, which requires the transformation (T) to occur before the data is loaded (L) into its usable form.

      With ELT, you can control the latency, or “freshness,” of data that is brought into the data lake. Some data needed for critical, real-time analysis can be streamed into the data lake, which means that a copy is sent to the data lake immediately after data is created or updated within a source application. (This is referred to as a low-latency data feed.) You essentially push data into your data lake piece by piece immediately upon the creation of that data.

      Other data may be less time-critical and can be “batched up” in a source application and then periodically transmitted in bulk to the data lake.

      You can specify the latency requirements for every single data feed from every single source application.

      

The ELT model also allows you to identify a new source of data for your data lake and then very quickly bring in the data that you need. You don’t need to spend days or weeks dissecting the ins and outs of the new data source to understand its structure and business rules. You “blast” the data into your data lake in the natural form of the data: database tables, MP4 files, or however the data is stored. Then, when it’s time to use that data for analysis, you can proceed to dig into the particulars and get the data ready for reports, machine learning, or however you’re going to be using and analyzing the data.

      Everyone visits the data lake

      Take a look around your organization today. Chances are, you have dozens or even hundreds of different places to go for reports and analytics. At one time, your company probably had the idea of building an enterprise data warehouse that would provide data for almost all the analytical needs across the entire company. Alas, for many reasons, you instead wound up with numerous data marts and other environments, very few of which work together. Even enterprise data warehouses are often accompanied by an entire portfolio of data marts in the typical organization.

      Great news! The data lake will finally be that one-stop shopping place for the data to meet almost all the analytical needs across your entire enterprise.

      Enterprise-scale data warehousing fell short for many different reasons, including the underlying technology platforms. Data lakes overcome those shortfalls and provide the foundation for an entirely new generation of integrated, enterprise-wide analytics.

Even with a data lake, you’ll almost certainly still have other data environments outside the data lake that support analytics. Your data lake objective should be to satisfy almost all your organization’s analytical needs and be the go-to place for data. If a few other environments pop up here and there, that’s okay. Just be careful about the overall proliferation of systems outside your data lake; otherwise, you’ll wind up right back in the same highly fragmented data mess that you have today before beginning work on your data lake.

      Suppose you head off for a weeklong vacation to your favorite lake resort. The people who run the resort have divided the lake into different zones, each for a different recreational purpose. One zone is set aside for water-skiing; a second zone is for speedboats, but no water-skiing is permitted in that zone; a third zone is only for boats without motors; and a fourth zone allows only swimming but no water vessels at all.

      The operators of the resort could’ve said, “What the heck, let’s just have a free-for-all out on the lake and hope for the best.” Instead, they wisely established different zones for different purposes, resulting in orderly, peaceful vacations (hopefully!) rather than chaos.

      A data lake is also divided into different zones. The exact number of zones may vary from one organization’s data lake to another’s, but you’ll always find at least three zones in use — bronze, silver, and gold — and sometimes a fourth zone, the sandbox.

Recommended Zone Name Other Names
Bronze zone Raw zone, landing zone
Silver zone Cleansed zone, refined zone
Gold zone Performance zone, curated zone, data model zone
Sandbox Experimental zone, short-term analytics zone

      

The boundaries and borders between your data lake zones can be fluid (Fluid? Get it?), especially with streaming data, as I explain in Part 2.

      The bronze zone

      You load your data into the bronze zone when the data first enters the data lake. First, you extract the data from a source application (the E part of ELT), and then the data is transmitted into the bronze zone in raw form (thus, one of the alternative names for this zone). You don’t correct any errors or otherwise transform or modify the data at all. The original operational data should look identical to the copy of that data now in the bronze zone.

      

Your catchphrase for loading data into the bronze zone is “the need for speed.” You may be trickling one piece of data at a time or bulk-loading hundreds of gigabytes or even terabytes of data. Your objective is to transmit the data into the data lake environment as quickly as possible. You’ll worry about checking out and refining that data later.

      The silver zone

      The silver zone consists of data that has been error-checked and cleansed