Dan Sullivan

Official Google Cloud Certified Professional Data Engineer Study Guide


Скачать книгу

as documents or as wide columns. An important distinction between the two is how data is retrieved from them.

      Fully Indexed, Semi-Structured Data

      { {’id’: ’123456’, ’product_type’: ’dishwasher’, ’length’: ’24 in’, ’width’: ’34 in’, ’weight’: ’175 lbs’, ’power’: ’1800 watts’ } {’id’:’987654’, ’product_type’: ’chair’, ’weight’: ’15 kg’, ’style’: ’modern’, ’color’: ’brown’ } }

      To search efficiently by attributes, document databases allow for indexes. If you use Cloud Datastore, for example, you could create indexes on each of the attributes as well as a combination of attributes. Indexes should be designed to support the way that data is queried. If you expect users to search for chairs by specifying style and color together, then you should create a style and color index. If you expect customers to search for appliances by their power consumption, then you should create an index on power.

      Creating a large number of indexes can significantly increase the amount of storage used. In fact, it is not surprising to have total index storage greater than the amount of storage used to store documents. Also, additional indexes can negatively impact performance for insert, update, and delete operations, because the indexes need to be revised to reflect those operations.

      Row Key Access

      Wide-column databases usually take a different approach to querying. Rather than using indexes to allow efficient lookup of rows with needed data, wide-column databases organize data so that rows with similar row keys are close together. Queries use a row key, which is analogous to a primary key in relational databases, to retrieve data. This has two implications.

Sensor ID Timestamp Temperature Relative humidity Pressure
789 1571760690 40 35 28.2
790 1571760698 42.5 50 29.1
791 1571760676 37 61 28.6
Timestamp Sensor ID Temperature Relative humidity Pressure
1571760676 791 37 61 28.6
1571760690 789 40 35 28.2
1571760698 790 42.5 50 29.1

      Unstructured Data

      The distinguishing characteristic of unstructured data is that it does not have a defined schema or data model. Structured data, like relational database tables, has a fixed data model that is defined before data is added to the table. Semi-structured databases include a schema with each row or document in the database. Examples of unstructured data include the following:

       Text files of natural language content

       Audio files

       Video files

       Binary large objects (BLOBs)

      Google’s Storage Decision Tree

      Schema Design Considerations

      Structured and semi-structured data has a schema associated with it. Structured data is usually stored in relational databases whereas semi-structured data is often stored in NoSQL databases. The schema influences how data is stored and accessed,