Dave Fowler

The Informed Company


Скачать книгу

at dbt Labs, the makers of dbt (data build tool). She helps data professionals learn and apply modern analytics‐engineering practices, and is an organizer for Coalesce, the dbt Community’s annual conference. David is a Data Science Consultant and was the Global Lead Data Science Instructor at General Assembly. He helps people around the world better leverage their data. Emilie, Mila, and David have shaped the narrative and content of this book. Their (sometimes) line‐by‐line feedback has ensured that we can proudly stand behind our recommendations.

      And lastly, it's worth noting and thanking some classic books that informed the previous generation of warehousing toolkits. We honor them by echoing their terminology and best practices wherever possible:

       Agile Data Warehouse Design by Lawrence Corr

       The Data Warehouse Toolkit by Ralph Kimball

       Information Dashboard Design by Stephen Few (my review here)

image

      Source: The Data School

      Few are complete “experts” in all of the areas of modern data governance, and the landscape is changing all of the time. If you have a story to share, or a chapter you think is missing, or a new idea, email us. Even if you don't know what specifically to share, but you don't mind sharing your story, please reach out as we are particularly interested in adding real‐world experiences and insights.

      There is already too much jargon in the data world, often created by talented vendor marketing teams. We try to stick with the most common and straightforward words that are already in use. For any jargon we do find necessary, we include a definition.

      There are many books for the old ways of working with data. We're highlighting current best practices here, so we ignore outdated terminology and techniques. In a few cases where it is beneficial to talk about industry evolution—like the change from ETL to ELT—we teach ELT and discuss the choice in a separate chapter.

      Almost every part of this book could be contentious to someone, in some use case or to some vendor. In writing this book, it is tempting to bring up the caveats everywhere and write what would ultimately be a very defensive and overly explained book. We believe this type of book is way less useful for people seeking straightforward advice. Where we have a strong opinion, we don't argue it; we just go with it. Where we think the user has a legitimate choice to make, we pose those options.

      This book aims to provide a broad overview and general guidelines on how to set up a data stack. We intentionally gloss over the details of launching a Redshift instance, writing SQL, or using various BI products. That would clutter the text, repeat what's already on the internet, and make the read quite stale.

      Not every company needs the entirety of this book. As a growing company's data needs expand, more and more of the book becomes valuable. Note, though, many best practices presented at each stage appear when they start to be relevant. These practices assume they are useful from the point they appear in the book, onward, to avoid redundancy. So it may benefit you to at least skim those earlier stages even if you and your company are further ahead.

      At the end of the book we have a section where we describe what has changed in the data world that makes this new architecture relevant and performant. We avoid explaining how our recommendations differ from previous practices like Kimball Dimensional modeling so as not to clutter the experience. Such discussions are necessary, however, and we've put them in this last section of the book.

      Lastly, throughout the book you will see the following icons:

       image Definitions

      They are related to a term found on the same page. For example, on this page, the term “data lake” is mentioned. A data lake is a staging area for several data sources.

       image Protips

      Protips expand on an idea or provide additional information about a topic related to what you read within a given chapter.

      In 2015, I used a product called Amazon Redshift. At the time, I had spent the prior 15 years of my career in a variety of roles all centered around their use of data, from analytics to marketing to operations. And while I considered my data competency my biggest professional differentiator, I had also become deeply frustrated. For all of the supposed progress in the data ecosystem, it was still slow, hard, and expensive to get insights out of data.

      But my first experience with Redshift is where that all changed for me. I have such a visceral memory of the first hour I spent with the product: queries I ran returned so fast that it seemed like absolute magic. I had spent years and years of my career writing queries and waiting for the MacOS “spinner” icon to stop spinning. Now, all the sudden, these same queries weren’t 20% faster…they were 10 to 100 to 1000x faster. I felt like I had superpowers.

      I'll let Dave and Matt actually explain how the modern data warehouse can achieve these types of performance results, but for now, just trust me that it can and does. Given that, the fascinating question is actually: what does this mean for people like you and me?