Dave Fowler

The Informed Company


Скачать книгу

rel="nofollow" href="#ulink_0a370394-5948-5dc7-83dd-296f20643e70">Figure I.2). Each of these stages has best practices and guidance on how to progress further “up the stack.”

image

      This book maps onto these stages:

      1 Source

      2 Lake

      3 Warehouse

      4 Mart

      Each vertical stage pictured could be what a company needs, depending on the resources, size, importance, and needs of data within an organization. Not every company needs all four stages. If a business gets the insights they need while in the source stage, then they should stay in the source stage. Likewise, a company may find it needs a lake to combine all of its data together but may not need the transformations implied in later stages of warehouse development.

      Let's begin with an overview of each step.

      Stage 1. Sources

      Many teams only have a few data sources and commonly manipulate and visualize data in Excel or Google sheets. Two common sources that company products run on are Google Analytics and “transactional” database(s), such as PostgreSQL or MySQL. If only a few people at a company need to work with data sources, it's possible to set them up with direct or replica access; it's more agile and straightforward for them to work with the data directly.

      This stage is right if:

       Only a few people are going to be working with this dataset.

       Data needs are minimal at the moment.

       Only a few small data sources exist.

       The only people who need to make new visuals are fairly technical.

      It's time for the next stage if:

       Data is accessed from multiple places/applications.

       There are needs for unique or combined charts/dashboards for cloud application sources like Salesforce.

       A growing number of people need access to data.

       There are performance issues.

       Data is getting too big for a transactional database to operate efficiently.

       Nontechnical users need to create charts without help.

      Stage 2. Lake

      Once companies must run analyses on multiple sources of data, each of which need joining, filtering, and manipulation, a company must move to a data lake. Blended data sources enable several actors in an organization to query a large subset of the company's complete data. In turn, funneling various sources into a data lake supports database performance at a reasonably large (not necessarily “big data”) scale.

      The central challenge faced by companies in the lake stage is knowing what toolset and methodology will unify and (safely) store your data. Companies looking to combine their data also run into performance issues, which we offer solutions to. And perhaps most important of all, choosing an architecture during lake development informs how easy (or hard) it will be for a company to build their future data warehouse.

      This stage is right if:

       There's a need for unique or combined charts/dashboards for cloud application sources like Salesforce.

       A core set of people can learn the ins and outs of the structure of the messy data.

       You're intimidated by data modeling. (Don't be: that's why this book exists.)

       There's no time for even light data modeling.

       Large datasets need performant queries.

      It's time for the next stage if:

       More than a few people are going to be working with this dataset.

       A clean source of truth would eliminate integrity issues.

       There's a need to separate the structure of the data from the always‐changing transactional sources.

       There's a need to adopt DRY (Don't Repeat Yourself) principles.

       image Modeling

      Data requires transformation so that it is more usable by different people or systems within a database. Modeling refers to this process of making these transformations to the data.

       image DRY

      An acronym that represents a software design ethos that avoids repetitive patterns through modularity of design.

      Stage 3. Warehouse (Single Source of Truth)

      As more people begin to work with the data lake, questions begin to multiply: What data is where? Why? What particular criteria should queries use when looking for data insights? What do these schemata mean? Unavoidable complexities make it harder to obtain data, especially by less‐technical colleagues. Even among in‐house experts, more schemata and entities (i.e. tables and views) in turn cause more communication headaches. In time, the data lake serves all data but makes it harder to obtain the right data. It gets harder to write queries and share the knowledge within an organization.

      All of these problems can be addressed with a clean and simplified version of the data, something we refer to as “a single source of truth.”

      In the data warehouse section of the book, we review how to clean data lakes and investigate