genes is always a demanding task. Due to presence of diversity and complexity in different types of cancer, the task is more challenging. With the emergence in the field of biotechnology a bulk amount of data is being generated by utilizing high-density oli-gonucleotide chips and cDNA arrays [8, 9]. Researchers now can measure thousands of gene expression data simultaneously. But there is lack of suitable algorithm to extract knowledge and mine the information from this type of biological data source which is very much significant. So, the increased demand always persists to explore and design suitable algorithm/s. While analyzing microarray data, one of the most significant applications is to classify the tissue samples that belong to normal and cancerous state. Nevertheless, during such application, it has always been observed that a large number of genes are identified which are irrelevant. So, this genes has got no impact on clinical application, and as a result, the efficiency of the method gets compromised [10, 11]. On the other side of the coin, working and interpreting with the huge number of genes incurs lack of feasibility. Thus, it is obvious to select accurate number of relevant genes by analyzing microarray data and has become really a promising one. Selecting these important genes is very much important from different angles of medical science which includes drug discovery, targeted therapy, prognosis, and sometimes early detection [12, 13].
Gene expression data generated through high-throughput technology comes in the form of matrix where each rows represents gene expression level but columns are the samples. As gene expression is considered as the features which is a very large but the experimental data, i.e., the samples are very few in numbers so it becomes really a complex task to work with. This is a real problem to start with the work with such huge dimensionality. Many algorithms based on different Artificial Intelligence (AI) techniques have been experimented over the years to find solution. Different algorithm based on Machine Learning (ML) approach, a branch of AI has been used over the years as an effective analytical tool this type data [14]. In ML technique model, data used in past is utilized in order to predict future result. Different learning methods based on statistical and probabilistic model and optimization techniques can be implemented for analyzing data. Learning methods like Logistic Regression (LR), artificial neural networks (ANN), K-nearest neighbor (KNN), decision trees (DT) and Naïve Bayes are widely used in different context [15, 16]. Two categories of learning in ML techniques are mainly used, i.e., supervised and unsupervised learning. The learning model implemented through learning from known classes (labeled training data) is termed as supervised learning. On the other hand, unsupervised learning methods learn from unknown class data often termed as unlabeled training data [17]. Algorithms designed by ML approach have been used for different purpose like classification of groups and key feature training and recognition. The real power of ML algorithms is it could recognize patterns from datasets which are large, noisy, and difficult to discern. This property is very much useful to process complex genomic data, specifically in the field of cancer related studies [18, 19].
While building a prediction model, LR is reckoned as a popular method where the outcome is binary and has been expanded to provide classification of disease with microarray data. Here, it is necessary to incorporate a feature (gene) selection technique and should be induced to penalize the logistic model. The fundamental reason is that, here, the number of genes is very large compare to number of samples. So, selection of proper model in this procedure needs new statistical methods. This is important because while predicting error assessment, the step for selecting features if ignored, could have impact of severely downward biased. The widely used methods which are mostly generic like cross-validation and non-parametric bootstrap may be not so effective owing to the huge vulnerability in predicting the error estimation process. The classification of diseases like cancer using microarray data has been considered the subject of extensive research in order to provide more precise diagnostic methods than the conventional pathological approach alone can provide. The expression of genes can also be used to predict survival time, disease prognosis and treatment response. The overall impact is very much significant as all the factors are having major clinical consequences. To design a logistic prediction model using microarray data, however, has got a fundamental difference from the standard logistic model owing to the observed number of genes, which often becomes thousands in number while the number of arrays (samples) observed is generally very lesser which is often less than one hundred. A common wise used approach is to combine a step in gene selection with a penalized inference of probability, called selection of features, which selects a subset of genes for inclusion in the LR model.
LR is a tool borrowed from the domain of statistics by ML. This method is used for classification problems which are binary in nature (problems having two values in the class). LR is widely used in the biological sciences where the dependent variable is categorical, i.e., it is a widely used method to build predictive models where the outcome is binary and is extended for utilizing as disease classification using microarray data [20]. In the present article, we have developed an algorithms using LR model to select feature (gene) whose mutation is having correlation with certain cancers. While designing proper gene selection algorithm using a ML model, it is a challenging task to reduce the computational complexity as because the dataset is of huge volume. The total number of genes (features) is very large in number. In the LR model, having too many features can cause of over fitting and performance of the algorithm is compromised [21]. There are many standard techniques which are widely used to reduce the dimensionality such as Kernel PCA, Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA). It is observed that, when the number of samples per class is smaller, PCA performs better, while LDA operates better for large datasets of multiple classes. While minimizing the dimensionality class repairability is considered as an essential factor. As our aim is to develop a binary classifier model, where we have overcome this by developing a hybrid approach where the number of features has been reduced using PCA. Although there are many techniques to do this, with PCA, loss of data is minimum in the context of the dataset it is appropriate to get better outcome. After that, the output of PCA is applied LR model for prediction of genes. A threshold value has been calculated and set for this binary classification which is applied on some test data to select which genes are selected as candidate genes or cancer mediating genes. The statistical and biological validation of obtained resultant set of genes has been accomplished at end.
3.2 Related Methods
Linear regression and LR both are statistical methods widely used by ML algorithms. Linear regression is effective for regression or for prediction of values continuous in nature, whereas LR is effective in both regression and classification problems. However, it is widely used mainly in the domain of classification algorithm. Models of regression seek to project values on the basis of independent characteristics. The key distinction which makes them different is when the dependent variables are assumed to be binary; LR is useful. However, when dependent variables are continuous, linear regression seems to be more effective.
In mathematics, linear models are well defined. For the purpose of predictive analysis, this model is commonly used nowadays. It uses a straight line to primarily address the relationship between a predictor and a dependent variable used as target. Basically, there exists two categories of linear regression, one is known as simple linear regression and the other one is known as multiple regressions. In linear regression, there could be independent variables which are either of type discrete or continuous, but it will have the dependent variables of type continuous in nature. If we assume that we have two variables, X as an independent variable and Y as a dependent variable, then a perfectly suited straight line is fit in linear regression model which is determined by applying a mean square method for finding the association between the independent variable X and the dependent variable Y. The relationship between them is always found to be linear. The key point is that in linear regression, the number of independent variable is one, but in case of multiple regressions, it can be one or more.
Although LR is commonly utilized for classification but it can effectively be applied in the field of regression also. The respondent variable being binary in nature can appear to any of the classes. The dependent variables aid in the process of predicting categorical variables. When there exists two classes and it is required to check where a new data point should belong, then a computing algorithm can determine the probability which ranges 0 to 1. LR model calculates the