Alan R. Simon

Data Lakes For Dummies


Скачать книгу

rel="nofollow" href="#fb3_img_img_7ae20d87-476e-5531-98df-0666cab2347a.png" alt="Remember"/> Building a data lake is more than just loading massive amounts of data into some storage location.

      To support this near-constant expansion and growth, you need to ensure that your data lake is well architected and solidly engineered, which means that the data lake

       Enforces standards and best practices for data ingestion, data storage, data transmission, and interchange among its components and data delivery to end users

       Minimizes workarounds and temporary interfaces that have a tendency to stick around longer than planned and weaken your overall environment

       Continues to meet your predetermined metrics and thresholds for overall technical performance, such as data loading and interchange, as well as user response time

      Think about a resort that builds docks, a couple of lakeside restaurants, and other structures at various locations alongside a large lake. You wouldn’t just hand out lumber, hammers, and nails to a bunch of visitors and tell them to start building without detailed blueprints and engineering diagrams. The same is true with a data lake. From the first piece of data that arrives, you need as solid a foundation as possible to help keep your data lake viable for a long time.

      A really great lake

      You’ll come across definitions and descriptions that tell you a data lake is a centralized store of data, but that definition is only partially correct.

The data services that you use for your data lake, such as the Amazon Simple Storage Service (S3), the Microsoft Azure Data Lake Storage (ADLS), or the Hadoop Distributed File System (HDFS) manage the distribution of data among potentially numerous servers where your data is actually stored. These services hide the physical distribution from almost everyone other than those who need to manage the data at the server storage level. Instead, they present the data as being logically part of a single data lake. Figure 1-1 illustrates how logical centralization accompanies physical decentralization.

Schematic illustration of a logically centralized data lake with underlying physical decentralization.

      FIGURE 1-1: A logically centralized data lake with underlying physical decentralization.

      Expanding the data lake

      How big can your data lake get? To quote the old saying (and to answer a question with a question), how many angels can dance on the head of a pin?

Schematic illustration of the cloud-based data lake solutions.

      FIGURE 1-2: Cloud-based data lake solutions.

Cloud providers give you pricing for data storage and access that increases as your needs grow or decreases if you cut back on your functionality. Basically, your data lake will be priced on a pay-as-you-go basis.

      Some of the very first data lakes that were built in the Hadoop environment may reside in your corporate data center and be categorized as on-prem (short for on-premises, meaning “on your premises”) solutions. But most of today’s data lakes are built in the Amazon Web Services (AWS) or Microsoft Azure cloud environments. Given the ever-increasing popularity of cloud computing, it’s highly unlikely that this trend of cloud-based data lakes will reverse for a long time, if ever.

      As long as Amazon, Microsoft, and other cloud platform providers can keep expanding their existing data centers and building new ones, as well as enhancing the capabilities of their data management services, then your data lake should be able to avoid scalability issues.

      

A multiple-component data lake architecture (see Chapter 4) further helps overcome performance and capacity constraints as your data lake grows in size and complexity, providing even greater scalability.

      More than just the water

      

A data lake is an entire environment, not just a gigantic collection of data that is stored within a data service such as Amazon S3 or Microsoft ADLS.

      In addition to data storage, a data lake also includes the following:

       One or (usually) more mechanisms to move data from one part of the data lake to another.

       A catalog or directory that helps keep track of what data is where, as well as the associated rules that apply to different groups of data; this is known as metadata.

       Capabilities that help unify meanings and business rules for key data subjects that may come into the data lake from different applications and systems; this is known as master data management.

       Monitoring services to track data quality and accuracy, response time when users access data, billing services to charge different organizations for their usage of the data lake, and plenty more.

      Different types of data

      If your data lake had a motto, it might be “All data are created equal.”

      In a data lake, data is data is data. In other words, you don’t need to make special accommodations for more complex types of data than you would for simpler forms of data.

      Structured data: Staying in your own lane

      You’re probably most familiar with structured data, which is made up of numbers, shorter-length character strings, and dates. Traditionally, most