can be done by starting from the root node. On the basis of the result, a branch that leads to a child must be followed. The process would be repeated recursively for the time until the child is not a leaf. To examine a class and its corresponding leaf, test cases must be applied.
b) Genetic Algorithms (GA)
It is used to solve a problem by using biological evolution techniques with the help of machine learning approach. A population of candidate solutions can be optimized with the help of Genetic Algorithm. In genetic algorithm genetic operators, i.e., selection, crossover and mutation are helpful for data structures modelling on chromosomes (Fu et al., 2006). In the beginning, random generation of a population of chromosomes could be performed. In this way, there will be all possible solutions of a problem in the population and that is considered as the candidate solutions. Dissimilar locations of a chromosome called “genes” which can be determined as numbers, characters or bits. To evaluate the goodness of each chromosome on the basis of the desired solution, we use fitness function. Natural reproduction can be stimulated by crossover operator whereas mutation of the species is stimulated by mutation operator. Fittest chromosomes can be chosen by the selection operator (Manek et al., 2016). Genetic Algorithms and its operations can be represented by Figure 2.2. Following are three important factors which we have to consider before using genetic algorithm for solving various problems.
Figure 2.2 Flowchart of genetic algorithm.
1 Fitness function
2 Individuals representation
3 Genetic algorithms parameters
For designing an artificial immune system, genetic algorithm-based method can be used. By using this method, a method for smartphone malware detection has been proposed by Bin et al. (Wu et al., 2015). In this approach, static and dynamic signatures of malwares were extracted to obtain the malicious scores of tested samples.
c) Random Forest
It is a classification algorithm that uses collection of tree structured classifiers. In this algorithm, a class is chosen as winner class on the basis of votes given by an individual tree of the forest. To construct a tree, there is a requirement of arbitrary data from a training dataset. Thus, the selected dataset could be divided into training dataset and test dataset. Training data comprises the major portion of the dataset whereas the test data will have the minor portion of the dataset. Following are the steps required for the tree construction:
1 A sample of N cases is arbitrarily selected from the original dataset which represents the training set required for growing the tree.
2 Out of the M input variables, m variables can be selected arbitrarily. Value of m will be constant at the time of growing the forest.
3 Maximum possible value can be given to each tree in the forest. There is no requirement of trimming or Pruning of the tree.
4 To form the random forest, all classification trees can be combined. The problem of overfitting on large dataset can be fixed with the help of random forest. It can also train/ test quickly on complex data set. It can also be referred as Operational Data mining technique.
Each and every classification tree can be used to cast vote for a class because of its special feature. On the basis of maximum votes assigned to a class, a solution class is built.
d) Association-rule mining
It is used to find fascinating relationships among a set of attributes in datasets (Dwork et al., 2006). Association rule can be defined as inter-relationship of a dataset. It is very helpful to build strategic decisions about different actions like shelf management, promotional pricing, and many more (Jackson et al., 2007). Earlier, a data analyst was involved in association rule mining whose task is to discover patterns or association rules in the dataset given to him (Rathore, 2017). It is possible to attain sophisticated analysis on these extremely large datasets in a cost-effective manner (Tseng et al., 2016), but there may be a chance of data security risk (Beaver et al., 2009) for the data possessor because data miner cans mines sensitive information (Bhargava et al., 2017). Nowadays, in knowledge data discovery (KDD) association rule mining is extensively used for pattern discovery. A problem of (ARM) can be solved by navigating the items in a database with the help of various algorithms on the basis of user’s requirement (Patel et al., 2014). Association rule mining (ARM) algorithms can be broadly classified into DFS (Depth First Search) and BFS (Breadth First Search) on the basis of approach used for traversing the search space (Stanley, 2013). These two methods, i.e., DFS (Depth First Search) and BFS (Breadth First Search) are further divided into methods – intersecting and counting, on the basis of item sets and their support value. The algorithms Apriori-DIC, Apriori and Apriori-TID are BFS-based counting strategies algorithms, whereas partition algorithms are intersecting strategies BFS algorithms. The Equivalence Class Clustering and bottom-up Lattice Traversal (ECLAT) algorithm works on the intersecting strategy with DFS. DFS with Counting strategies comprises FP-Growth algorithm (Yeung, Ding, 2003), (Bloedorn et al., 2003). For improvement in speed, these algorithms can be optimized specifically (Barrantes et al., 2001), (Reddy et al., 2011).
Breadth First Search (BFS) with Counting Occurrences: An eminent algorithm in this group is Apriori algorithm. By clipping the candidates with rare subsets and with the help of this algorithm, the downward closure property of an itemset can be utilized. It should be done before counting their support. Two important parameters to be measured at the time of association rule evaluation which is: support and confidence. In BFS, it is possible to do desired optimization by knowing the support values of all subsets of the candidates in advance. The main drawback of the above mentioned is the increment in computational complexity in a rule that has been extracted from a large database. An improved, dispersed and unsecured form of the Apriori algorithm is Fast Distributed Mining (FDM) algorithm (Lee et al., 1999). Organizations are able to use data more competently with the help of advancements in data mining techniques.
It is possible in Apriori to count the candidates of a cardinality k with the help of a single scan of a large database. Most important limitation of apriori algorithm is to look up the candidates in each transaction. To do the same, a hash tree structure is used (Jacobsan et al., 2014). An extension of Apriori, i.e., Apriori-TID, signifies the current candidate on which each transaction is based, while a raw database is sufficient for a normal Apriori. Apriori and Apriori-TID when combined form Apriori-Hybrid. A prefix-tree is used to fix up the parting that occurs between the processes, counting and candidate generation in Apriori-DIC.
2.3 Clustering
A data mining technique is used for grouping a set of objects in such a way that there is more similarity in the objects of the same class as compared to the objects of the other class. It means cluster of same class, i.e., similarity of intra-cluster is maximum and similarity of inter-cluster is minimum. Unsupervised learning can be performed with the help of clustering. Following are the types of clustering algorithms:
1 a) Distribution Based
2 b) Density Based
3 c) Centroid Based
4 d) Connection Based or Hierarchical Clustering
5 e) Recent Clustering Techniques
a) Distribution-Based Clustering
A model of clustering in which the date is grouped/fitted in the model on the basis of probability, i.e., in what way it may fit into the same distribution. Thus, the groups formed will be on the basis of either normal distribution or gaussian distribution
b) Density-Based Clustering
In