Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

6 Streaming Data Algorithms

Data stream poses a significant number of challenges to mining algorithms and research community due to the high‐traffic, high‐velocity, and brief life span of streaming data [79]. Many algorithms that are suitable for mining data at rest are not suited to streaming data due to the inherent characteristics of streaming data [80]. Some of the constraints that are naturally imposed on mining algorithms by streaming data include (i) the concept of a single pass, (ii) the probability distribution of data chunk is not known in advance, (iii) no limitation on the amount of generated data, (iv) the size of incoming data may vary, (v) the incoming data may belong to various sub‐clusters, and (vi) access to correct class labels is limited due to overhead incurred by label query for each arriving instance [81]. The constraints further generate other problems, which include: (i) capturing sub‐cluster data within the bounded learning time complexity, (ii) the minimum number of epochs required to achieve the learning time complexity, and (iii) making algorithm robust in the face of dynamically evolving and irregular streaming data.

Different streaming data mining tasks include clustering, similarity search, prediction, classification, and object detection, among others [82, 83]. Algorithms used for streaming data analysis can be grouped into four: Unsupervised learning, semi‐supervised learning, supervised learning, and ontology‐based techniques. These are subsequently described.

6.1 Unsupervised Learning

Unsupervised learning is a type of learning that draws inductions from the unlabeled dataset [84]. Data stream source is nonstationary, and for clustering algorithms, there is no information about the data distribution in advance [85]. Due to several iterations required to compute similarity or dissimilarity in the observed dataset, the entirety of the datasets ought to be accessible in memory before running the algorithm in most cases. However, with data stream clustering, the challenge is searching for a new structure in data as it evolves, which involves characterizing the streaming data in the form of clusters to leverage them to report useful and interesting patterns in the data stream [86]. Unsupervised learning algorithms are suitable for analyzing data stream as it does not require a predefined label [87]. Clusters are ordered dependent on scoring function, for example, catchphrase or keyword, hashtags, the semantic relationship of terms, and segment extraction [88].

Data stream clustering can be grouped into five categories, which are partitioning methods, hierarchical methods, model‐based methods, density‐based methods, and grid‐based methods.

Partition‐based techniques try to find out k‐partitions based on some measurement. Partitioning clustering methods are not suitable for streaming scenarios since they require earlier information on cluster number. Examples of partition‐based methods include Incremental K‐Mean, STREAMKM++, Stream LSearch, HPStream, SWClustering, and CluStream.

Hierarchical methods can be further subdivided into divisive and agglomerative. With divisive hierarchical clustering, a cluster is divided into small clusters until it cannot be split further. In contrast, agglomerative hierarchical clustering merges up separate clusters until the distance between two clusters reaches a required threshold. Balanced iterative reducing and clustering using hierarchies (BIRCH), open distributed application construction (ODAC), E‐Stream, clustering using representatives (CURE), and HUE‐ are some hierarchical algorithms for data stream analysis.

In model‐based methods, a hypothesized model is run for each cluster to check which data properly fits a cluster. Some of the algorithms that fit into this category are CluDistream, Similarity Histogram‐based Incremental Clustering, sliding window with expectation maximization (SWEM), COBWEB, and Evolving Fractal‐Based Clustering of Data Streams.

Density‐based methods separate data into density regions (i.e., nonoverlapping cells) of different shapes and sizes. Density‐based algorithms require a single pass and can handle noise. Stating the number of clusters in advance is not also required. Some density‐based algorithms include DGStream, MicroTEDAclus, clustering of evolving data‐streams into arbitrary shapes (CEDAS), Incremental DBSCAN (Density‐Based Spatial Clustering with Noise), DenStream, r‐DenStream, DStream, DBstream, data stream clustring (DSCLU), MR‐Stream, Ordering Points to Identify Clustering Structure (OPTICS), OPClueStream, and MBG‐Stream.

6.2 Semi‐Supervised Learning

Semi‐supervised learning belongs to a class of AI frameworks that trains on the combination of both the unlabeled and labeled data [89]. Semi‐supervised learning in data stream context is challenging because data are being generated at real‐time and the labels may be missing due to different factors, which include communication errors, network delays, expensive labeling processes, among others [90]. According to Zhu and Li [91], a semi‐supervised learning problem in a data stream context is defined as follows. Let upper S equals left-brace left-parenthesis x Subscript t Baseline comma y Subscript t Baseline right-parenthesis right-brace Subscript t equals 1 Superscript upper T 0 as the data in the first T₀ time period and S denote streaming data. Let Y = {1, 2, …, K} be the known label set. The arriving data stream has an instance x_t and y_t ∈ Y^′ = {−1, 1, 2, …, K}. If y_t = − 1, x_t is an unlabelled instance, but the true label is in set Y. As time goes on, evolution happens, a data stream upper S prime equals left-brace left-parenthesis x Subscript t Baseline comma y Subscript t Baseline right-parenthesis right-brace Subscript t equals upper T 0 plus 1 Superscript infinity Baseline comma which contains novel classes. That is, there-exists left-brace right-brace comma xt prime comma yt prime element-of upper S prime where y Subscript t Sub Superscript prime Subscript Baseline equals negative 1 comma but the true label of x Subscript t prime is not in set Y. Note that if holds forever.

Semi‐supervised learning on streaming data may return similar results to that of the supervised approach. However, there are observations with semi‐supervised learning on streaming data, which include (i) to balance out classifiers, considerably more objects ought to be labeled, and (ii) more significant threshold adversely impacts the strength of classifiers with the increase in standard deviation and a bigger threshold [19]. Some of the semi‐supervised learning techniques for data streams include ensemble techniques, graph‐based methods, deep learning, active learning, linear neighborhood propagation.

6.3 Supervised Learning

Supervised learning is the type of machine learning that infers function from trained labeled data. The training examples contain a couple of input (vector) and output (supervisory signal). Let data stream S = {…, d_{t − 1}, d_t, d_{t + 1}, …}, where d_t = {x_i, y_i}, x_i is the value set of the ith datum in each attribute and y_i is the class of the instance. Data stream classification aims to train a classifier f : x → y that establishes a mapping relationship between feature vectors and class labels [92].

Supervised learning approaches can be subdivided into two major categories, which are regression and classification. When the class attribute is continuous, it is called regression, but when the class attribute is discrete, it is referred to as classification. Manual labeling is difficult, time‐consuming, and could be very costly [93]. In a streaming scenario with high velocity and volume,

Скачать книгу