Alan R. Simon

Data Lakes For Dummies


Скачать книгу

When you divide your big data into multiple zones, add capabilities to transmit data across those zones, and then govern the whole environment, you’ve built a data lake surrounding that big data foundation. You’ve done the analytical data equivalent of building the docks, the restaurants, and the boat slips surrounding the lake itself.

      In addition to data lakes, you may come across references to data ponds, data puddles, data rivers, data oceans, and data hot tubs. (Just kidding about the last one.) What’s going on here?

      

Your job when planning, architecting, building, and using a data lake is complicated by the fact that you don’t have an official definition published by some sort of standards body, such as the American National Standards Institute (ANSI) or the International Organization for Standardization (ISO). That means that you or anyone else can define, use, and even publish your own terminology. You can call a smaller portion of a data lake a “data pond” if you want, or refer to a collection of data lakes as a “data ocean.”

      Don’t panic! Of all the “data plus a body of water” terms you’ll run across, data lake is by far the most commonly used. All the characteristics of a data lake — solid architecture, support for multiple forms of data, a support ecosystem surrounding the data — apply to what you can call a data pond or any other term.

      If William Shakespeare were still around and plied his trade as an enterprise data architect rather than as a writer, he would put it this way: “A data lake by any other name would still be worth the time and effort to build.”

      BACK TO THE FUTURE WITH NAME CHANGES

      In the early 1990s, data warehousing was the newest and most popular game in town for analytical data management. By the mid-’90s, the concept of a data warehouse was adapted to a data mart — essentially, a smaller-scale data warehouse. The original idea behind a data mart called for the data warehouse feeding a subset of its data into one or more data marts — sort of a “wholesaler-retailer” model.

      The first generation of data warehouse projects, especially very large ones, was hallmarked by a high failure rate. By the late ’90s, data warehouses were viewed as large, complex, and expensive efforts that were also very risky. A data mart, on the other hand, was smaller, less complex, and less expensive, and, thus, considered to be less risky.

      The need for integrated analytical data was stronger than ever by the end of the ’90s. But just try to get funding for a data warehousing project! Good luck!

      Time for plan B.

      Data warehouses went out of style for a while. Instead, data marts became the go-to solution for analytic data. No matter how big and complex an environment was, chances are, you’d refer to it as a data mart rather than a data warehouse. In fact, the idea of an independent data mart sprung up, and the original architecture for a data mart — receiving data from a data warehouse rather than directly from source systems — became known as a dependent data mart.

      Fast-forward a couple of decades, and it’s back to the future. First, big data sort of evolved into data lakes. Now you have analysts, consultants, and vendors complicating the picture with their own terminology. This won’t be the last time you’ll see shifting names and terminology in the world of analytic data, so stay tuned!

      Planning Your Day (and the Next Decade) at the Data Lake

      IN THIS CHAPTER

      

Taking advantage of big data

      

Broadening your data type horizons

      

Implementing a built-to-last analytical data environment

      

Reeling in existing stand-alone data marts

      

Blockading new stand-alone data marts

      

Deciding what to do about your data warehouses

      

Aligning your data lake plans with your organization’s analytical needs

      

Setting your data velocity speed limits

      

Getting a handle on your analytical costs

      Suppose that you and about 15 other family members or friends all head to your favorite lake for a weeklong summer vacation.

      A data lake is very much like that weeklong trip to your favorite lake. Because a data lake is an enterprise-scale effort, spanning numerous organizations and departments, as well as many different business functions, you and your coworkers will likely seek a variety of varying benefits and outcomes from all that hard work.

      The best data lakes are those that satisfy the needs of a broad range of constituencies — basically, something for everyone to make the results well worth the effort.

      Maybe your organization has been dabbling in the world of big data for a while, going back to when Hadoop was one of the hottest new technologies. You’ve built some pretty nifty predictive analytics models, and now you’re fairly adept at discovering important patterns buried in mountains of data.

      So far, though, your AAA — adventures in advanced analytics — have been highly fragmented. In fact, your analytical data is all over the place. You don’t have consistent approaches to cleansing and refining raw data to get the data ready for analytics; different groups do their own thing. It’s like the Wild West out there!

      The concept of a data lake helps you harness the power of big data technology to the benefit of your entire organization. By following emerging best practices, avoiding traps and pitfalls, and building a solidly architected data lake, you can seize the day and help take your organization to new heights when it comes to analytics and data-driven insights.

      

You’ll achieve economies of scale for the data side of analytics throughout your organization, which means that you’ll get “more bang for your buck” when it comes to acquiring, consolidating, preparing,