Группа авторов

Computational Statistics in Data Science


Скачать книгу

issues to adapt to the high‐volume and complexity in data. While data streams provide the opportunity for machine learning algorithms to uncover useful and interesting patterns, traditional machine learning algorithms face the challenge of scalability to truly uncover the hidden value in the data stream [26].

      3.2 Integration

      Building a distributed framework with every node having a data stream flow view implies that each node is liable for performing analysis with few sources. Aggregating these views to build a complete view is inconsequential. This calls for the development of an integration technique that can perform efficient operations through disparate datasets [27].

      3.3 Fault‐Tolerance

      For life‐critical systems, high fault‐tolerance is required. In streaming computing environments, where unbounded data are generated in real‐time, an amazing high adaptation to noncritical failure procedure and scalable system, is required to allow an application to keep working without interruption despite component failure. The most widely recognized adaptation to internal failure is checkpointing, where the framework state is intermittently persisted to recapture the computational state after system failures. However, the overhead incurred with checkpointing can negatively affect system performance. An improved checkpointing to minimize the overhead cost was proposed by [28, 29].

      3.4 Timeliness

      Time is essential for time‐sensitive processes, which incorporate foiling fraud, mitigating security threats, or responding to a natural disaster. Such architectures or platforms must be scalable to enable consistent handling of data streams [30]. The fundamental challenge bothers on implementing a distributed architecture for data aggregation with insignificant latency between the communicating nodes.

      3.5 Consistency

      Achieving high consistency or stability in the data stream computing environments is nontrivial as it is hard to figure out which data are required and which nodes ought to be consistent [31, 32]. Thus, a good framework is required.

      3.6 Heterogeneity and Incompleteness

      Data streams are heterogeneous in structure, semantics, organizations, granularity, and accessibility. Different data in disparate sources, different formats, combined with the volume of data, make the integration, retrieval, and reasoning over the data stream a challenging task [33]. The challenge here is how to deal with ever‐growing data, extract, aggregate, and correlate data streams from numerous sources in real‐time. There is need to design a competent data presentation to mirror the structure, hierarchy, and diversity of data streams.

      3.7 Load Balancing

      3.8 High Throughput

      Decision concerning the identification of the data stream portion that needs replication, number of these replicas that is required, and which of the data stream to assign to each replica is an issue in data stream computing environment. Proper multiple instances replication is required if high throughput is to be achieved [35].

      3.9 Privacy

      Data stream analytics open doors for real‐time analysis of massive amount of data but also made a colossal danger to individual privacy [36]. As indicated by the International Data Cooperation (IDC), half of the aggregate data that needs protection is not adequately protected. Relevant and efficient privacy‐preserving solutions for interpretation, observation, evaluation, and decision for data stream mining should be designed [37]. The sensitive nature of data necessitates privacy‐preserving techniques. One of the leading privacy‐preserving techniques is perturbation [38].

      3.10 Accuracy

      Developing efficient methods that can precisely predict future observations is one of the leading goals of data stream analysis. Yet, the intrinsic features of data stream, which include noisy characteristics, velocity, volume, variety, value, veracity, variability, and volatility, data stream analysis strongly constrain processing algorithms spatiotemporally. Hence, to guarantee high accuracy, mitigation of these challenges must be taken into consideration as they can negatively influence the accuracy of data stream analysis [39].

      Data stream pre‐processing, which aims at reducing the inherent complexity associated with streaming data for a faster, more understandable, and interpretable, and more precise learning process is an essential technique in knowledge discovery. However, despite the recorded growth in online learning, data stream pre‐processing methods still have a long way to go due to the high level of noise [66]. These noisy terms incorporate a short length of messages, slangs, abbreviations, acronyms, blended dialects, linguistic and spelling mistakes, sporadic, casual, shortened words, and ill‐advised sentence structure, which make it hard for learning algorithms to perform productively and adequately [67]. Additionally, error from sensor reading due to low battery, damage, incorrect calibrations, among others, can render data delivered from such sensors unsuitable for analysis [68].

      Data quality is a fundamental determinant in the knowledge discovery pipeline as low‐quality data yields low‐quality models and choices [69]. There is need to strengthen data stream pre‐processing stage in the face of multi‐label [70], imbalance [71], and multi‐instance [72] problems associated data stream [66]. Also, data stream pre‐processing techniques with low computational requirement [73] needs to be evolved as this is still open for research. Moreover, the representation of social media posts must be in a way that the semantics of social media content is preserved [74, 75]. To improve the result of analysis in the data stream, there is need to develop frameworks that will cope with the noisy characteristics, redundancy, heterogeneity, data imbalance, transformation, feature representation, or selection issues in data streams [26]. Some of the new frameworks developed for pre‐processing and enriching data stream for better results are SlangSD [76], N‐gram and Hidden Markov Model [77], SLANGZY [78], and SMFP [67].