Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

of the constrained measure of labeled data accessible for building the models [94].

Some of the supervised learning algorithms for streaming scenario are grouped as presented in [95] (i) Tree‐based algorithms: OLIN, Ultra‐Fast Forest Tree system (UFFT), Very Fast Decision Tree learner (VFDT), VFDTc, Random Forest, and Vertical Hoeffding Tree, Concept‐adapting Evolutionary Algorithm for Decision Tree (CEVOT); (ii) Rule‐based algorithms: On‐demand classifier, Fuzzy Passive‐aggressive classification, Similarity‐based data stream classification (SimC), Prequential area under curve (AUC) based classifier, one‐class classifier with incremental learning and forgetting, and Classifying recurring concept using fuzzy similarity function; (iii) Ensemble‐based algorithms: Streaming ensemble algorithm, Weighted classifier ensemble, Distance‐based ensemble online classifier with kernel clustering; (iv) Nearest‐neighbor: Adaptive nearest neighbor classification algorithm, anytime nearest neighbor algorithm; (v) Statistical: Evolving Naïve Bayes; (vi) Deep learning: Activity recognition [96].

6.4 Ontology‐Based Methods

Performing streaming data analysis over ontologies and linked open data are a challenging and emerging research area. Semantic web technology, an extension of the World Wide Web, is used to improve the interoperability of heterogeneous sources with a data model called Resource Description Framework (RDF) and ontological languages such as Web Ontology Language (OWL). Some of the works done using ontology or linked open data on data stream include [97–99]. Due to the dynamic nature of data stream, current solutions for reasoning over the data model and ontological languages are not suited to streaming data context. This gap brought about what is referred to as stream reasoning. Stream reasoning is the set of inference approaches and deduction mechanisms concerned with the provision of continuous inference over a data stream, leading to a better decision support system [100]. Stream reasoning has been applied in remote health monitoring [101], smart cities [102], semantic analysis of social media [103], maritime safety, and securities [104], amongst others. Another attempt to improve semantic web ontology is to lift the existing streams to RDF streams using intuitive configuration mechanisms. Some of the techniques for RDF stream modeling include Semantic Sensor Network (SSN) ontology [105], Stream Annotation Ontology (SOA) [106], smart appliances reference (SAREF) ontology [107], and Linked Stream Annotation Engine (LSane) [108].

7 Strategies for Processing Data Streams

Data stream processing includes techniques, models, and systems for processing data as soon as they arrive to detect trends and patterns in a low latency [109]. Data stream processing requires two factors which include storage capability and computational power in the face of an unbounded generation of data with high velocity and brief life span. To cope with these requirements, approximate computing, which aims at low latency at the expense of acceptable quality loss, has been a practical solution [110]. The ideology behind approximate computing is based on returning approximate answer instead of the exact answer for user queries. This is done by choosing a representative sample of data instead of the whole data [111]. The two main techniques for approximate computing includes (i) sampling [4], which constructs data stream summaries by probability selection, and (ii) sketches [112], which compress data using data structure (such as histogram or hash tables), prediction‐based method (such as Bayesian Inference), and transformation‐based method (such as wavelet).

Fixed window and sliding window are two computation models for the partitioning of the data stream. Fixed window partitions data stream into nonoverlapping time segments, and the current data are removed after processing, resetting the window size back to zero. The sliding window contains a historical snapshot of the data stream at any point in time. When the arriving data are at variance with the current window elements, tuples are updated by discarding the oldest data [5]. The sliding window can be further sub‐divided into a count‐based window and time‐based window. In the count‐based window, the progressive step is expressed in tuple counts, while items with the oldest timestamp are replaced with items with the latest timestamp in the time‐based window [113].

8 Best Practices for Managing Data Streams

A data stream is so dynamic that dealing with data in motion is not just limited to design‐time but also a run‐time problem that requires an operation that must be managed in real‐time. Stream computing has emerged as a capability of real‐time applications in smart cities, monitoring systems, manufacturing, and financial markets [15]. Data stream management systems should be able to update the answers to continuous queries as new data arrives. Choosing the right processing model for streaming data is challenging, given the growing number of frameworks with various and similar services [114]. When a high volume of data from disparate sources is needed to be processed at a short time interval, Storm and Flink may be considered. For purely stream processing, Storm is recommended for high stream‐oriented applications as it can process millions of events per second. When it comes to durability, scalability, high‐throughput, and low‐latency capabilities, Apache Kafka is a good option [115]. Yahoo! S4 has capabilities for real‐time response, fault‐tolerance, and scalability [116]. Spark framework may be suitable for periodic processing tasks such as fraud detection, web usage mining, and so on. For a task that combines both batch and streaming programming models such as IoT and healthcare, Spark and Flink may be good candidates [117]. Some of the frameworks that support iterative processing or machine learning tasks are Flink (FlinkML) Spark (Spark MLlib), GraphX with Spark, and Flinkgelly with Flink. Other graph processing frameworks include Bladgy, Graphlab, and Trinity.

IBM InfoSphere Streams can handle millions of messages or events in a second with high throughput rates, making it one of the leading proprietary solutions for real‐time applications [61]. Apama Stream Analytics is suitable for real‐time and high‐volume business operations [62]. Azure Stream is another proprietary solution for driving streaming analytics and IoT goals [62]. Other reasonable proprietary solutions include Kinesis, PieSync, TIBCO Spotfire, Google Cloud Pub/Sub, Azure Event Hubs, Kibana, Amazon Elastic Search Service, and Kibana.

In an ideal case, choosing a single streaming data technology that supports all the system requirements such as the state of data, use case, and kind of results seems the best as this alleviates the problems of interoperability constraints.

9 Conclusion and the Way Forward

In this chapter, we have considered cutting‐edge issues concerning data stream or streaming data. The interest in stream processing is on the increase, and data must be handled quickly to make decisions in real‐time. The key presumption of stream computing is that the likelihood estimation of data lies in its newness. Thus, data analysis is done the moment they arrive in a stream instead of what is obtained in batch processing where data are first stored before they are explored. Challenges for data stream analysis include concept drift, scalability, integration, fault tolerance, timeliness, consistency, heterogeneity and incompleteness, load balancing, privacy issues, and accuracy [27, 28, 30–32, 34, 35], which emerges from the nature of data streams.

Streaming is an active research area. However, there are still some aspects of streaming that have received little attention. One of them is transactional guarantees. Current stream processing can provide basic guarantees such as processing each data point in the stream exactly once or at least once but cannot provide guarantees that span multiple operations or stream elements. Another area to intensify research effort is data stream pre‐processing. Data quality is a vital determinant in the knowledge discovery pipeline as low‐quality data yields low‐quality models and choices [69]. There is need to reinforce data stream pre‐processing stage [67] in the face of multi‐label [70], imbalance [71], and multi‐instance [72] problems associated data stream [66]. Also, the representation of social media posts must be such that the semantics of social media content is preserved [74, 75]. Moreover, data stream pre‐processing techniques with low computational requirement [73] need to be

Скачать книгу