Seifedine Kadry

Big Data


Скачать книгу

sources are erroneous, incomplete, and inconsistent because of their massive volume and heterogeneous sources, and it is pointless to store useless and dirty data. Additionally, some analytical applications have a crucial requirement for quality data. Hence, for effective, efficient, and accurate data analysis, systematic data preprocessing is essential.

      10  What is data integration? Data integration involves combining data from different sources to give the end users a unified data view.

      11  What is data cleaning? The data‐cleaning process fills in the missing values, corrects the errors and inconsistencies, and removes redundancy in the data to improve the data quality. The larger the heterogeneity of the data sources, the higher the degree of dirtiness. Consequently, more cleaning steps may be involved.

      12  What is data reduction? Data processing on massive data volume may take a long time, making data analysis either infeasible or impractical. Data reduction is the concept of reducing the volume of data or reducing the dimension of the data, that is, the number of attributes. Data reduction techniques are adopted to analyze the data in reduced format without losing the integrity of the actual data and yet yield quality outputs.

      13  What is data transformation? Data transformation refers to transforming or consolidating the data into an appropriate format that is acceptable by the big data database and converting them into logical and meaningful information for data management and analysis.

      1  Give some examples of big data. Facebook is generating approximately 500 terabytes of data per day, about 10 terabytes of sensor data are generated every 30 minutes by airlines, the New York Stock Exchange is generating approximately 1 terabyte of data per day. These are examples of big data.

      2  How is big data analysis useful for organizations? Big data analytics is useful for the organizations to make better decisions, find new business opportunities, compete against business rivals, improve performance and efficiency, and reduce cost by using advanced data analytics techniques.

      CHAPTER OBJECTIVE

      The various storage concepts of big data, namely, clusters and file system are given a brief overview. The data replication, which has made big the data storage concept a fault tolerant system is explained with master‐slave and peer‐peer types of replications. Various storage types of on‐disk storage are briefed. Scalability techniques, namely, scaling up and scaling out, adopted by various database systems are overviewed.

      In big data storage, architecture data reaches users through multiple organization data structures. The big data revolution provides significant improvements to the data storage architecture. New tools such as Hadoop, an open‐source framework for storing data on clusters of commodity hardware, are developed, which allows organizations to effectively store and analyze large volumes of data.

      In modern BI architecture the raw data stored in Hadoop can be analyzed using MapReduce programs. MapReduce is the programming paradigm of Hadoop. It can be used to write applications to process the massive data stored in Hadoop.

image

image

      2.1.1 Types of Cluster

      Clusters may be configured for various purposes such as web‐based services or computational‐intensive workloads. Based on their purpose, the clusters may be classified into two major types:

       High availability

       Load balancing

      2.1.1.1 High Availability Cluster

      High availability clusters are designed to minimize downtime and provide uninterrupted service when nodes fail. Nodes in a highly available cluster must have access