at dbt Labs, the makers of dbt (data build tool). She helps data professionals learn and apply modern analytics‐engineering practices, and is an organizer for Coalesce, the dbt Community’s annual conference. David is a Data Science Consultant and was the Global Lead Data Science Instructor at General Assembly. He helps people around the world better leverage their data. Emilie, Mila, and David have shaped the narrative and content of this book. Their (sometimes) line‐by‐line feedback has ensured that we can proudly stand behind our recommendations.
Influences
We've drawn on several sources of information and opinion when writing this text. While at Chartio, we worked with hundreds of modern cloud‐based customers. We've collected, implemented, and refined these practices ourselves, and through writing this book, vetted them further with partners and customers. We've also learned from the data community through dataschool.com, blogs like Tristan Handy's, and data‐focused slack communities.
And lastly, it's worth noting and thanking some classic books that informed the previous generation of warehousing toolkits. We honor them by echoing their terminology and best practices wherever possible:
Agile Data Warehouse Design by Lawrence Corr
The Data Warehouse Toolkit by Ralph Kimball
Information Dashboard Design by Stephen Few (my review here)
How This Book Was Written
This book originates in part from a project within The Data School (Figure A.2), a collection of free online books and interactive tutorials on managing and leveraging data (see dataschool.com). These resources are always expanding, much like the articles of Wikipedia: each round of updates sees our ebooks cover additional topics, go deeper on established ideas, share more real‐world examples, and better deliver that content. Our goal is to maintain and improve these resources and keep them modern.
Source: The Data School
Few are complete “experts” in all of the areas of modern data governance, and the landscape is changing all of the time. If you have a story to share, or a chapter you think is missing, or a new idea, email us. Even if you don't know what specifically to share, but you don't mind sharing your story, please reach out as we are particularly interested in adding real‐world experiences and insights.
There is already too much jargon in the data world, often created by talented vendor marketing teams. We try to stick with the most common and straightforward words that are already in use. For any jargon we do find necessary, we include a definition.
There are many books for the old ways of working with data. We're highlighting current best practices here, so we ignore outdated terminology and techniques. In a few cases where it is beneficial to talk about industry evolution—like the change from ETL to ELT—we teach ELT and discuss the choice in a separate chapter.
Almost every part of this book could be contentious to someone, in some use case or to some vendor. In writing this book, it is tempting to bring up the caveats everywhere and write what would ultimately be a very defensive and overly explained book. We believe this type of book is way less useful for people seeking straightforward advice. Where we have a strong opinion, we don't argue it; we just go with it. Where we think the user has a legitimate choice to make, we pose those options.
This book aims to provide a broad overview and general guidelines on how to set up a data stack. We intentionally gloss over the details of launching a Redshift instance, writing SQL, or using various BI products. That would clutter the text, repeat what's already on the internet, and make the read quite stale.
How to Read This Book
The book starts with a quick overview and decision charts about what the stages are and what stage is appropriate for you. This book is structured with a section for each of the four stages, and if you'd like, you can jump ahead to the stage you're at.
Not every company needs the entirety of this book. As a growing company's data needs expand, more and more of the book becomes valuable. Note, though, many best practices presented at each stage appear when they start to be relevant. These practices assume they are useful from the point they appear in the book, onward, to avoid redundancy. So it may benefit you to at least skim those earlier stages even if you and your company are further ahead.
At the end of the book we have a section where we describe what has changed in the data world that makes this new architecture relevant and performant. We avoid explaining how our recommendations differ from previous practices like Kimball Dimensional modeling so as not to clutter the experience. Such discussions are necessary, however, and we've put them in this last section of the book.
Lastly, throughout the book you will see the following icons:
They are related to a term found on the same page. For example, on this page, the term “data lake” is mentioned. A data lake is a staging area for several data sources.
Protips expand on an idea or provide additional information about a topic related to what you read within a given chapter.
Foreword
In 2015, I used a product called Amazon Redshift. At the time, I had spent the prior 15 years of my career in a variety of roles all centered around their use of data, from analytics to marketing to operations. And while I considered my data competency my biggest professional differentiator, I had also become deeply frustrated. For all of the supposed progress in the data ecosystem, it was still slow, hard, and expensive to get insights out of data.
But my first experience with Redshift is where that all changed for me. I have such a visceral memory of the first hour I spent with the product: queries I ran returned so fast that it seemed like absolute magic. I had spent years and years of my career writing queries and waiting for the MacOS “spinner” icon to stop spinning. Now, all the sudden, these same queries weren’t 20% faster…they were 10 to 100 to 1000x faster. I felt like I had superpowers.
I'll let Dave and Matt actually explain how the modern data warehouse can achieve these types of performance results, but for now, just trust me that it can and does. Given that, the fascinating question is actually: what does this mean for people like you and me?
What kind of “people” do I mean? You know—people who are involved in making decisions