Alan R. Simon

Data Lakes For Dummies


Скачать книгу

application in JavaScript Object Notation (JSON) format and land in the bronze zone in raw form, looking exactly as the data was in the source system itself — errors and all.

      You’ll patch up any known errors, handle missing data, and otherwise cleanse the data. Then you’ll store the cleansed data in the silver zone, still in JSON format.

      

Not all data from your bronze zone will be cleansed and copied into your silver zone. The data lake model calls for loading massive amounts of data into the bronze zone without having to do upfront analysis to determine which data is definitely or likely needed for analysis. When you decide what data you need, you do the necessary data cleansing and move only the cleansed data into the silver zone.

      The gold zone

      The gold zone is the final home for your most valuable analytical data. You’ll curate data coming from the silver zone, meaning that you’ll group and restructure data into “packages” dedicated to your organization’s high-value analytical needs.

      LINKING THE DATA LAKE ZONES TOGETHER

      The following figure shows the progressive pipelines of data among the various zones, including the sandbox. Notice how not every piece or group of data is cleansed and then sent from the bronze zone to the silver zone. You’ll spend time refurbishing, refining, and transmitting data to the silver zone that you definitely or likely need for analytics.

Schematic illustration of the progressive pipelines of data among the various zones, including the sandbox.

      Likewise, select data sets are sent from the silver zone to the gold zone. Remember that another name for the gold zone is the curated zone, meaning that you’ve especially selected certain data to be consolidated and then placed in “packages” within the gold zone.

      You might transmit raw, uncleansed data from the bronze zone into the sandbox along with data from the silver zone, depending on the specifics of your experimental or short-term analytical needs.

You will almost certainly replicate data across the various gold zone packages, but that’s not a problem at all. As long as you carefully control the data flows and the replicated data, you’re unlikely to run into problems with uncontrolled data proliferation.

      The sandbox

      But what about shorter-term analytical needs or experiments that you want to run with your data? You may be building new machine learning models to predict customer behavior, optimize your supply chain, or determine new treatment plans for a hospital system’s patients. You need to experiment with different machine learning techniques, and you need actual data for your work.

      Head over to the sandbox and start playing. You’ll load whatever data you need for your short-term or experimental work and do your thing. The data lake isolates the sandbox from the data pipeline, so you can do whatever you need without interfering with your organization’s primary analytical work.

      Turn the clock back to the early 2010s when big data burst onto the scene. Almost every organization was exploring how this new generation of data management technology can overcome many of the barriers and constraints of relational databases, particularly for analytical storage.

      Big data promised — and delivered — significantly greater capacity than was possible with relational databases. With big data, you can store unstructured and semi-structured data alongside your structured data. You can also bring new data into a big data environment with lower latency than with relational databases.

      Wait a minute! That sounds just like the description of a data lake! So, is a data lake just another name for big data?

      Well, sort of … possibly … or maybe not… .

      The best way to think of the two disciplines in relation to one another is as follows:

       Big data is the underlying core technology used to build a data lake.

       A data lake is an environment that includes big data but also potentially other data management technologies along with services for data transmission and data governance.

      THE THREE (OR FOUR OR FIVE OR MORE) VS OF BIG DATA AND DATA LAKES

      Quick quiz: Name all the Vs of big data and data lakes. You can start with the original three: volume, variety, and velocity. But you’ll also find blog posts and online articles that mention value, veracity (a formal term for accuracy), visualization, and many others. In fact, don’t be surprised if one day you read an article or blog post that also includes Valentine’s Day!

      The original three Vs of big data came from a Gartner Group analyst named Doug Laney, way back in 2001. Volume, variety, and velocity were primarily aspirational characteristics of data environments, describing next-generational characteristics beyond what the relational databases of the time were capable of supporting.

      Over the years, other industry analysts, bloggers, consultants, and product vendors added to the list with their own Vs. The difference between the original three Vs and those that followed, though, is that value, veracity, visualization, and others all apply to tried-and-true relational technology just as much as to big data.

      Don’t get confused trying to decide how many Vs apply to big data and to data lakes. Just focus on the original three — volume, variety, and velocity — as the must-have characteristics of your data lake.

You’ll find varying perspectives on the relationship between big data and data lakes, which certainly confuses the issue. Some technologists reverse the relationship between big data and data lakes; they consider a data lake to be the core technology and big data to be the overall environment. So, if you run across a blog post or another description that differs from the one I use, don’t worry. As with almost everything about data lakes and much of the technology world, you’ll find all sorts of opinions and perspectives, especially when you don’t have any official standards to govern a discipline.

      The Hadoop open source environment, particularly the HDFS, is one of the first and most popular examples of big data. Some of the earliest data lakes were built, or at least begun, using HDFS as the foundation.

For purposes of establishing a data lake foundation, Amazon’s S3 and Microsoft’s ADLS both qualify as big data. Why? Both S3 and ADLS support the three Vs of big data, which are as follows:

       Storing extremely large volumes of data

       Supporting a variety of data, including structured, unstructured, and semi-structured data

       Allowing very high velocity for incoming data into the data lake rather than requiring or at least encouraging periodic batches of data

      

Think of big data as a core technology foundation that supports the three Vs of next-generation data management. Big data by itself, however, is