Группа авторов

Computational Statistics in Data Science


Скачать книгу

      Different streaming data mining tasks include clustering, similarity search, prediction, classification, and object detection, among others [82, 83]. Algorithms used for streaming data analysis can be grouped into four: Unsupervised learning, semi‐supervised learning, supervised learning, and ontology‐based techniques. These are subsequently described.

      6.1 Unsupervised Learning

      Unsupervised learning is a type of learning that draws inductions from the unlabeled dataset [84]. Data stream source is nonstationary, and for clustering algorithms, there is no information about the data distribution in advance [85]. Due to several iterations required to compute similarity or dissimilarity in the observed dataset, the entirety of the datasets ought to be accessible in memory before running the algorithm in most cases. However, with data stream clustering, the challenge is searching for a new structure in data as it evolves, which involves characterizing the streaming data in the form of clusters to leverage them to report useful and interesting patterns in the data stream [86]. Unsupervised learning algorithms are suitable for analyzing data stream as it does not require a predefined label [87]. Clusters are ordered dependent on scoring function, for example, catchphrase or keyword, hashtags, the semantic relationship of terms, and segment extraction [88].

      Data stream clustering can be grouped into five categories, which are partitioning methods, hierarchical methods, model‐based methods, density‐based methods, and grid‐based methods.

      Partition‐based techniques try to find out k‐partitions based on some measurement. Partitioning clustering methods are not suitable for streaming scenarios since they require earlier information on cluster number. Examples of partition‐based methods include Incremental K‐Mean, STREAMKM++, Stream LSearch, HPStream, SWClustering, and CluStream.

      Hierarchical methods can be further subdivided into divisive and agglomerative. With divisive hierarchical clustering, a cluster is divided into small clusters until it cannot be split further. In contrast, agglomerative hierarchical clustering merges up separate clusters until the distance between two clusters reaches a required threshold. Balanced iterative reducing and clustering using hierarchies (BIRCH), open distributed application construction (ODAC), E‐Stream, clustering using representatives (CURE), and HUE‐ are some hierarchical algorithms for data stream analysis.

      In model‐based methods, a hypothesized model is run for each cluster to check which data properly fits a cluster. Some of the algorithms that fit into this category are CluDistream, Similarity Histogram‐based Incremental Clustering, sliding window with expectation maximization (SWEM), COBWEB, and Evolving Fractal‐Based Clustering of Data Streams.

      6.2 Semi‐Supervised Learning

      Semi‐supervised learning belongs to a class of AI frameworks that trains on the combination of both the unlabeled and labeled data [89]. Semi‐supervised learning in data stream context is challenging because data are being generated at real‐time and the labels may be missing due to different factors, which include communication errors, network delays, expensive labeling processes, among others [90]. According to Zhu and Li [91], a semi‐supervised learning problem in a data stream context is defined as follows. Let upper S equals left-brace left-parenthesis x Subscript t Baseline comma y Subscript t Baseline right-parenthesis right-brace Subscript t equals 1 Superscript upper T 0 as the data in the first T0 time period and S denote streaming data. Let Y = {1, 2, …, K} be the known label set. The arriving data stream has an instance xt and ytY = {−1, 1, 2, …, K}. If yt = − 1, xt is an unlabelled instance, but the true label is in set Y. As time goes on, evolution happens, a data stream upper S prime equals left-brace left-parenthesis x Subscript t Baseline comma y Subscript t Baseline right-parenthesis right-brace Subscript t equals upper T 0 plus 1 Superscript infinity Baseline comma which contains novel classes. That is, there-exists left-brace right-brace comma xt prime comma yt prime element-of upper S primewhere y Subscript t Sub Superscript prime Subscript Baseline equals negative 1 comma but the true label of x Subscript t primeis not in set Y. Note that if y Subscript t Sub Superscript prime Subscript Baseline not-equals negative 1 comma y Subscript t Sub Superscript prime Subscript Baseline element-of upper Y holds forever.

      Semi‐supervised learning on streaming data may return similar results to that of the supervised approach. However, there are observations with semi‐supervised learning on streaming data, which include (i) to balance out classifiers, considerably more objects ought to be labeled, and (ii) more significant threshold adversely impacts the strength of classifiers with the increase in standard deviation and a bigger threshold [19]. Some of the semi‐supervised learning techniques for data streams include ensemble techniques, graph‐based methods, deep learning, active learning, linear neighborhood propagation.

      6.3 Supervised Learning

      Supervised learning is the type of machine learning that infers function from trained labeled data. The training examples contain a couple of input (vector) and output (supervisory signal). Let data stream S = {…, dt − 1, dt, dt + 1, …}, where dt = {xi, yi}, xi is the value set of the ith datum in each attribute and yi is the class of the instance. Data stream classification aims to train a classifier f : xy that establishes a mapping relationship between feature vectors and class labels [92].

      Supervised learning approaches can be subdivided into two major categories, which are regression and classification. When the class attribute is continuous, it is called regression, but when the class attribute is discrete, it is referred to as classification. Manual labeling is difficult, time‐consuming, and could be very costly [93]. In a streaming scenario with high velocity and volume,