illustrates the data generated by various sources that were discussed above.
1.6 Different Types of Data
Data may be machine generated or human generated. Human‐generated data refers to the data generated as an outcome of interactions of humans with the machines. E‐mails, documents, Facebook posts are some of the human‐generated data. Machine‐generated data refers to the data generated by computer applications or hardware devices without active human intervention. Data from sensors, disaster warning systems, weather forecasting systems, and satellite data are some of the machine‐generated data. Figure 1.6 represents the data generated by a human in various social media, e‐mails sent, and pictures that were taken by them and machine data generated by the satellite.
Figure 1.6 Human‐ and machine‐generated data.
The machine‐generated and human‐generated data can be represented by the following primitive types of big data:
Structured data
Unstructured data
Semi‐structured data
1.6.1 Structured Data
Data that can be stored in a relational database in table format with rows and columns is called structured data. Structured data often generated by business enterprises exhibits a high degree of organization and can easily be processed using data mining tools and can be queried and retrieved using the primary key field. Examples of structured data include employee details and financial transactions. Figure 1.7 shows an example of structured data, employee details table with EmployeeID as the key.
1.6.2 Unstructured Data
Data that are raw, unorganized, and do not fit into the relational database systems are called unstructured data. Nearly 80% of the data generated are unstructured. Examples of unstructured data include video, audio, images, e‐mails, text files, and social media posts. Unstructured data usually reside on either text files or binary files. Data that reside in binary files do not have any identifiable internal structure, for example, audio, video, and images. Data that reside in text files are e‐mails, social media posts, pdf files, and word processing documents. Figure 1.8 shows unstructured data, the result of a Google search.
Figure 1.7 Structured data—employee details of an organization.
Figure 1.8 Unstructured data—the result of a Google search.
1.6.3 Semi‐Structured Data
Semi‐structured data are those that have a structure but do not fit into the relational database. Semi‐structured data are organized, which makes it easier to analyze when compared to unstructured data. JSON and XML are examples of semi‐structured data. Figure 1.9 is an XML file that represents the details of an employee in an organization.
Figure 1.9 XML file with employee details.
1.7 Big Data Infrastructure
The core components of big data technologies are the tools and technologies that provide the capacity to store, process, and analyze the data. The method of storing the data in tables was no longer supportive with the evolution of data with 3 Vs, namely volume, velocity, and variety. The robust RBDMS was no longer cost effective. The scaling of RDBMS to store and process huge amount of data became expensive. This led to the emergence of new technology, which was highly scalable at very low cost.
The key technologies include
Hadoop
HDFS
MapReduce
Hadoop – Apache Hadoop, written in Java, is open‐source framework that supports processing of large data sets. It can store a large volume of structured, semi‐structured, and unstructured data in a distributed file system and process them in parallel. It is a highly scalable and cost‐effective storage platform. Scalability of Hadoop refers to its capability to sustain its performance even under highly increasing loads by adding more nodes. Hadoop files are written once and read many times. The contents of the files cannot be changed. A large number of computers interconnected working together as a single system is called a cluster. Hadoop clusters are designed to store and analyze the massive amount of disparate data in distributed computing environments in a cost effective manner.
Hadoop Distributed File system – HDFS is designed to store large data sets with streaming access pattern running on low‐cost commodity hardware. It does not require highly reliable, expensive hardware. The data set is generated from multiple sources, stored in an HDFS file system in a write‐once, read‐many‐times pattern, and analyses are performed on the data set to extract knowledge from it.
MapReduce – MapReduce is the batch‐processing programming model for the Hadoop framework, which adopts a divide‐and‐conquer principle. It is highly scalable, reliable, and fault tolerant, capable of processing input data with any format in parallel and distributed computing environments supporting only batch workloads. Its performance reduces the processing time significantly compared to the traditional batch‐processing paradigm, as the traditional approach was to move the data from the storage platform to the processing platform, whereas the MapReduce processing paradigm resides in the framework where the data actually resides.
1.8 Big Data Life Cycle
Big data yields big benefits, starting from innovative business ideas to unconventional ways to treat diseases, overcoming the challenges. The challenges arise because so much of the data is collected by the technology today. Big data technologies are capable of capturing and analyzing them effectively. Big data infrastructure involves new computing models with the capability to process both distributed and parallel computations with highly scalable storage and performance. Some of the big data components include Hadoop (framework), HDFS (storage), and MapReduce (processing).
Figure 1.10 illustrates the big data life cycle. Data arriving at high velocity from multiple sources with different data formats are captured. The captured data is stored in a storage platform such as HDFS and NoSQL and then preprocessed to make the data suitable for analysis. The preprocessed data stored in the storage platform is then passed to the analytics layer, where the data is processed using big data tools such as MapReduce and YARN and analysis is performed on the processed data to uncover hidden knowledge from it. Analytics and machine learning are important concepts