lake?
With ELT, you can control the latency, or “freshness,” of data that is brought into the data lake. Some data needed for critical, real-time analysis can be streamed into the data lake, which means that a copy is sent to the data lake immediately after data is created or updated within a source application. (This is referred to as a low-latency data feed.) You essentially push data into your data lake piece by piece immediately upon the creation of that data.
Other data may be less time-critical and can be “batched up” in a source application and then periodically transmitted in bulk to the data lake.
You can specify the latency requirements for every single data feed from every single source application.
Everyone visits the data lake
Take a look around your organization today. Chances are, you have dozens or even hundreds of different places to go for reports and analytics. At one time, your company probably had the idea of building an enterprise data warehouse that would provide data for almost all the analytical needs across the entire company. Alas, for many reasons, you instead wound up with numerous data marts and other environments, very few of which work together. Even enterprise data warehouses are often accompanied by an entire portfolio of data marts in the typical organization.
Great news! The data lake will finally be that one-stop shopping place for the data to meet almost all the analytical needs across your entire enterprise.
Enterprise-scale data warehousing fell short for many different reasons, including the underlying technology platforms. Data lakes overcome those shortfalls and provide the foundation for an entirely new generation of integrated, enterprise-wide analytics.
The Data Lake Olympics
Suppose you head off for a weeklong vacation to your favorite lake resort. The people who run the resort have divided the lake into different zones, each for a different recreational purpose. One zone is set aside for water-skiing; a second zone is for speedboats, but no water-skiing is permitted in that zone; a third zone is only for boats without motors; and a fourth zone allows only swimming but no water vessels at all.
The operators of the resort could’ve said, “What the heck, let’s just have a free-for-all out on the lake and hope for the best.” Instead, they wisely established different zones for different purposes, resulting in orderly, peaceful vacations (hopefully!) rather than chaos.
A data lake is also divided into different zones. The exact number of zones may vary from one organization’s data lake to another’s, but you’ll always find at least three zones in use — bronze, silver, and gold — and sometimes a fourth zone, the sandbox.
Bronze, silver, and gold aren’t “official” standardized names, but they are catchy and easy to remember. Other names that you may find are shown in Table 1-1.
TABLE 1-1 Data Lake Zones
Recommended Zone Name | Other Names |
---|---|
Bronze zone | Raw zone, landing zone |
Silver zone | Cleansed zone, refined zone |
Gold zone | Performance zone, curated zone, data model zone |
Sandbox | Experimental zone, short-term analytics zone |
All the data lake zones, including the sandbox, are discussed in more detail in Part 2, but the following sections provide a brief overview.
The bronze zone
You load your data into the bronze zone when the data first enters the data lake. First, you extract the data from a source application (the E part of ELT), and then the data is transmitted into the bronze zone in raw form (thus, one of the alternative names for this zone). You don’t correct any errors or otherwise transform or modify the data at all. The original operational data should look identical to the copy of that data now in the bronze zone.
The silver zone
The silver zone consists of data that has been error-checked and cleansed