Dan Sullivan

Official Google Cloud Certified Professional Data Engineer Study Guide


Скачать книгу

SELECT … FROM … GROUP BY SELECT state, COUNT(*) FROM address GROUP BY state Returns the number of addresses in each state SELECT … FROM … GROUP BY … HAVING SELECT state, COUNT(*) FROM address GROUP BY state HAVING COUNT(*) > 50 Returns the number of addresses in each state that has at least 50 addresses

      NoSQL Database Design

       Key-value

       Document

       Wide column

       Graph

      Each type of NoSQL database is suited for different use cases depending on data ingestion, entity relationships, and query requirements.

      Key-Value Data Stores

Key Value
Instance1 PartitionA
Instance2 PartitionB
Instance3 PartitionA
Instance4 PartitionC

      Key-value data stores are simple, but it is possible to have more complex data structures as values. For example, a JSON object could be stored as a value. This would be reasonable use of a key-value data store if the JSON object was only looked up by the key, and there was no need to search on items within the JSON structure. In situations where items in the JSON structure should be searchable, a document database would be a better option.

      Cloud Memorystore is a fully managed key-value data store based on Redis, a popular open source key-value datastore. As of this writing, Cloud Memorystore does not support persistence, so it should not be used for applications that do not need to save data to persistent storage. Open source Redis does support persistence. If you wanted to use Redis for a key-value store and wanted persistent storage, then you could run and manage your own Redis service in Compute Engine or Kubernetes Engine.

      Document Databases

      Consider an online game that requires a database to store information about players’ game state. The player state includes

       Player name

       Health score

       List of possessions

       List of past session start and end times

       Player ID

      The player name, health score, and list of possessions are often read together and displayed for players. The list of sessions is used only by analysts reviewing how players use the game. Since there are two different use cases for reading the data, there should be two different documents. In this case, the first three attributes should be in one document along with the player ID, and the sessions should be in another document with player ID.

      When you need a managed document database in GCP, use Cloud Datastore. Alternatively, if you wish to run your own document database, MongoDB, CouchDB, and OrientDB are options.

      Wide-Column Databases

      Wide-column databases are used for use cases with the following:

       High volumes of data

       Need for low-latency writes

       More write operations than read operations

       Limited range of queries—in other words, no ad hoc queries

       Lookup by a single key

      Wide-column databases have a data model similar to the tabular structure of relational tables, but there are significant differences. Wide-column databases are often sparse, with the exception of IoT and other time-series databases that have few columns that are almost always used.

      Bigtable is GCP’s managed wide-column database. It is also a good option for migrating on-premises Hadoop HBase databases to a managed database because Bigtable has an HBase interface. If you wish to manage your own wide column, Cassandra is an open source option that you can run in Compute Engine or Kubernetes Engine.

      Graph Databases

      Data is retrieved from a graph using one of two types of queries. One type of query uses SQL-like declarative statements describing patterns to look for in a graph, such as the following the Cypher query language. This query returns a list of persons and friends of that person’s friends:

      MATCH (n:Person)-[:FRIEND]-(f) MATCH (n)-[:FRIEND]-()-[:FRIEND]-(fof) RETURN n, fof

      The other option is to use a traversal language, such as Gremlin, which specifies how to move from node to node in the graph.

      GCP does not have a managed graph database, but Bigtable can be used as the storage backend for HGraphDB (https://github.com/rayokota/hgraphdb) or JanusGraph (https://janusgraph.org).

      Exam Essentials

      Know the four stages of the data lifecycle: ingest, storage, process and analyze, and explore and visualize. Ingestion is the process of bringing application data, streaming data, and batch data