Lillian Pierson

Data Science For Dummies


Скачать книгу

store NoSQL database, and MongoDB is the most-popular document-oriented type of NoSQL database. It uses dynamic schemas and stores JSON-esque documents.

      

A document-oriented database is a NoSQL database that houses, retrieves, and manages the JSON files and XML files that you heard about back in Chapter 1, in the definition of semistructured data. A document-oriented database is otherwise known as a document store.

      

Some people argue that the term NoSQL stands for Not Only SQL, and others argue that it represents non-SQL databases. The argument is rather complex and has no cut-and-dried answer. To keep things simple, just think of NoSQL as a class of nonrelational systems that don’t fall within the spectrum of RDBMSs that are queried using SQL.

      Storing big data on-premise

      Although cloud storage and cloud processing of big data is widely accepted as safe, reliable, and cost-effective, companies have a multitude of reasons for using on-premise solutions instead. In many instances of the training and consulting work I’ve done for foreign governments and multinational corporations, cloud data storage was the ultimate “no-fly zone” that should never be breached. This is particularly true of businesses I’ve worked with in the Middle East, where local security concerns were voiced as a main deterrent for moving corporate or government data to the cloud.

      Though the popularity of storing big data on-premise has waned in recent years, many companies have their reasons for not wanting to move to a cloud environment. If you find yourself in circumstances where cloud services aren’t an option, you’ll probably appreciate the following discussion about on-premise alternatives.

      

The Kubernetes and NoSQL databases described earlier in this chapter can be deployed on-premise as well as in a cloud environment.

      Reminiscing about Hadoop

      Because big data’s three Vs (volume, velocity, and variety) don’t allow for the handling of big data using traditional RDMSs, data engineers had to become innovative. To work around the limitations of relational systems, data engineers originally turned to the Hadoop data processing platform to boil down big data into smaller datasets that are more manageable for data scientists to analyze. This was all the rage until about 2015, when market demands had changed to the point that the platform was no longer able to meet them.

      

When people refer to Hadoop, they’re generally referring to an on-premise Hadoop storage environment that includes the HDFS (for data storage), MapReduce (for bulk data processing), Spark (for real-time data processing), and YARN (for resource management).

      Incorporating MapReduce, the HDFS, and YARN

      MapReduce is a parallel distributed processing framework that can process tremendous volumes of data in-batch — where data is collected and then processed as one unit with processing completion times on the order of hours or days. MapReduce works by converting raw data down to sets of tuples and then combining and reducing those tuples into smaller sets of tuples. (With respect to MapReduce, tuples refers to key-value pairs by which data is grouped, sorted, and processed.) In layperson terms, MapReduce uses parallel distributed computing to transform big data into data of a manageable size.

      

In Hadoop, parallel distributed processing refers to a powerful framework in which data is processed quickly via the distribution and parallel processing of tasks across clusters of commodity servers.

      Storing data on the Hadoop distributed file system (HDFS)

      The HDFS uses clusters of commodity hardware for storing data. Hardware in each cluster is connected, and this hardware is composed of commodity servers — low-cost, low-performing generic servers that offer powerful computing capabilities when run in parallel across a shared cluster. These commodity servers are also called nodes. Commoditized computing dramatically decreases the costs involved in storing big data.

      The HDFS is characterized by these three key features:

       HDFS blocks: In data storage, a block is a storage unit that contains some maximum number of records. HDFS blocks can store 64 megabytes of data, by default.

       Redundancy: Datasets that are stored in HDFS are broken up and stored on blocks. These blocks are then replicated (three times, by default) and stored on several different servers in the cluster, as backup, or as redundancy.

       Fault-tolerance: As mentioned earlier, a system is described as fault-tolerant if it’s built to continue successful operations despite the failure of one or more of its subcomponents. Because the HDFS has built-in redundancy across multiple servers in a cluster, if one server fails, the system simply retrieves the data from another server.

      Putting it all together on the Hadoop platform

      The Hadoop platform was designed for large-scale data processing, storage, and management. This open-source platform is generally composed of the HDFS, MapReduce, Spark, and YARN (a resource manager) all working together.

      Within a Hadoop platform, the workloads of applications that run on the HDFS (like MapReduce and Spark) are divided among the nodes of the cluster, and the output is stored on the HDFS. A Hadoop cluster can be composed of thousands of nodes. To keep the costs of input/output (I/O) processes low, MapReduce jobs are performed as close as possible to the data — the task processors are positioned as closely as possible to the outgoing data that needs to be processed. This design facilitates the sharing of computational requirements in big data processing.

      Introducing massively parallel processing (MPP) platforms

      Massively parallel processing (MPP) platforms can be used instead of MapReduce as an alternative approach for distributed data processing. If your goal is to deploy parallel processing on a traditional on-premise data warehouse, an MPP may be the perfect solution.

      To understand how MPP compares to a standard MapReduce parallel-processing framework, consider that MPP runs parallel computing tasks on costly custom hardware, whereas MapReduce runs them on inexpensive commodity servers. Consequently, MPP processing capabilities are cost restrictive. MPP is quicker and easier to use than standard MapReduce jobs. That’s because MPP can be queried using Structured Query Language (SQL), but native MapReduce jobs are controlled by the more complicated Java programming language.

      Processing big data in real-time

      A real-time processing framework is — as its name implies — a framework that processes data in real-time (or near-real-time) as the data streams and flows into the system. Real-time frameworks process data in microbatches — they return results in a matter of seconds rather than the hours or days it typically takes batch processing frameworks like MapReduce. Real-time processing frameworks do one of the following:

       Increase the overall time efficiency of the system: Solutions in this category include Apache Storm and Apache Spark for near-real-time stream processing.

       Deploy innovative querying methods to facilitate the real-time querying of big data: Some solutions in this category are Google’s Dremel, Apache Drill, Shark for Apache Hive, and Cloudera’s Impala.