Seifedine Kadry

Big Data


Скачать книгу

illustrates the data generated by various sources that were discussed above.

image

      The machine‐generated and human‐generated data can be represented by the following primitive types of big data:

       Structured data

       Unstructured data

       Semi‐structured data

      1.6.1 Structured Data

      1.6.2 Unstructured Data

image image

      1.6.3 Semi‐Structured Data

      The core components of big data technologies are the tools and technologies that provide the capacity to store, process, and analyze the data. The method of storing the data in tables was no longer supportive with the evolution of data with 3 Vs, namely volume, velocity, and variety. The robust RBDMS was no longer cost effective. The scaling of RDBMS to store and process huge amount of data became expensive. This led to the emergence of new technology, which was highly scalable at very low cost.

      The key technologies include

       Hadoop

       HDFS

       MapReduce

      Hadoop – Apache Hadoop, written in Java, is open‐source framework that supports processing of large data sets. It can store a large volume of structured, semi‐structured, and unstructured data in a distributed file system and process them in parallel. It is a highly scalable and cost‐effective storage platform. Scalability of Hadoop refers to its capability to sustain its performance even under highly increasing loads by adding more nodes. Hadoop files are written once and read many times. The contents of the files cannot be changed. A large number of computers interconnected working together as a single system is called a cluster. Hadoop clusters are designed to store and analyze the massive amount of disparate data in distributed computing environments in a cost effective manner.

      Hadoop Distributed File system – HDFS is designed to store large data sets with streaming access pattern running on low‐cost commodity hardware. It does not require highly reliable, expensive hardware. The data set is generated from multiple sources, stored in an HDFS file system in a write‐once, read‐many‐times pattern, and analyses are performed on the data set to extract knowledge from it.

      Big data yields big benefits, starting from innovative business ideas to unconventional ways to treat diseases, overcoming the challenges. The challenges arise because so much of the data is collected by the technology today. Big data technologies are capable of capturing and analyzing them effectively. Big data infrastructure involves new computing models with the capability to process both distributed and parallel computations with highly scalable storage and performance. Some of the big data components include Hadoop (framework), HDFS (storage), and MapReduce (processing).