for examination purpose. Data of both normal and carcinogenic states are given as the input of the algorithm to generate the target output. The algorithm follows a hybrid approach where PCA has been incorporated for minimization of dimensionality of the dataset. Then, prepared logistic model is applied as a binary classifier to detect the collection of genes which might have possible relation with cancer. Our developed PC-LR model is applied on both lung and colon data.
3.4.1 Description of the Dataset
In our algorithm, two datasets, viz., lung and colon, are considered for testing and getting the output. With the help of microarray experiments, human gene expression is measured for lung and the data is obtained for tumor and normal sample. Total of 96 samples are collected of which 86 samples belong to tumor and 10 as normal state. In a more descriptive manner, it can be stated that among 86 samples of lung adenocarcinoma, 67 belong to stage I and 19 is of stage III. Ten lung samples are identified as neoplastic sample. The colon data consists of 7,464 genes with 18 samples that belong to carcinogenic state and 18 with normal state. More detailed information can be accessed from the site https://www.ncbi.nlm.nih.gov
3.4.2 Result Analysis
While executing the algorithm taking r = 5, i.e., a group of five genes is selected at random at a time. So, for lung dataset, it consisting of 5 cols (genes/features) and 96 rows (samples), which is divided into test and training dataset. For colon, it is 5 and 36, divided in same manner. Here, test data consist of 20% of the dataset and rest 80% belongs to training dataset. This dataset is scaled down by applying standard scalar and features of dataset is brought down onto unit scale. Then, PCA is applied on the selected 5 × 96 matrix. While applying PCA, the variance α is taken as 0.95 as number of components, parameter on both lung and colon datasets.
After reducing the dimensionality of the dataset, LR is applied using “sag” method for faster convergence. Predictive value is calculated based on the training dataset and then accuracy is calculated by comparing this predicted value and test data. When the accuracy was found to be more than 85%, those genes were selected as cancer mediating gene and stored in a new list.
For lung dataset, 886 genes were selected. When these genes were matched with the genes in the NCBI database, 102 were found to be true positive (TP). For colon dataset, 207 genes were selected out of which 85 were found to be TP when matched with NCBI database.
3.4.3 Result Set Validation
The generated result set genes for lung and colon dataset having correlation with cancers have been validated biologically using NCBI database. NCBI provides a gene database (http://www.ncbi.nlm.nih.gov/Database) where the disease mediating gene list corresponding to a specific disease can be obtained. The list is arranged in terms of relevance of the genes. We have got different sets of genes for lung cancer and colon cancer. The algorithm has selected 886 genes for lung and 207 for colon cancer as mutated genes. For lung expression data, we have compared this set of genes with 1,067 genes from NCBI. Here, we have identified 102 common in both the sets. We call these genes TP genes (Figure 3.4). Thus, 784 (886 − 102) genes are not in the list of genes obtained from NCBI. We denote these genes as false positive (FP) and 965 (1,067 − 102) genes are identified as false negative (FN). Likewise, for colon data, 1,223 genes are in the NCBI database. In this case, our algorithm has identified 207 genes. So, when compared with NCBI database, 85 genes got matched and marked as TP and 1,138 (1,223 − 85) genes are identified as FN and 122 (207 − 85) genes are FP (Figure 3.3).
It is very important while developing an efficient algorithm using ML model with a skewed dataset. For example, if the dataset is about cancer detection, then the task becomes more significant. Accuracy alone cannot decide for a skewed dataset whether the algorithm is working efficiently or not. What happens is that if we see in the dataset that in 99% of the time, then there is no cancer. In a binary classification problem, we can easily predict 0 all the time (predicting 1 if cancer and 0 if no cancer) to get a 99% accuracy. If we implement that model, then we will have a 99% accurate model based on ML algorithm but we will never detect cancer. If someone has cancer, then s/he will never get detected and will not get treatment. In our problem, we want to detect cancer mediating genes whose expression level changes significantly from normal state to cancerous state. So, here also, only accuracy is not going to work. There are different evaluation matrices that can help with these types of datasets. Those evaluation metrics are called precision-recall evaluation metrics. The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall. The F-score is commonly used effectively for many kinds of ML models. Moreover, for a binary classification problem, it is very much significant to analyze the accuracy vs. F-score to evaluate the efficiency of the model. Accuracy is defined as simply the number of correctly categorized examples divided by the total number of examples. Accuracy can be useful but does not take into account the subtleties of class imbalances, or differing costs of FN and FP. On the other hand, F-score is an effective measure when there are either differing costs of FP or FN or where there is a large class imbalance. As our proposed method works with gene expression data where number of genes is very large in number but the number of genes whose mutation is correlated to cancer will be very less, so in this case, the accuracy would be misleading, since a classifier that classifies set of genes not related to cancer would automatically get 90% accuracy but would be useless for the proposed work and hence will have little contribution in real-world application specially in the field of medical science. As a result, F-score has been given importance to evaluate the efficacy of the proposed model by proper application precision and recall.
Figure 3.3 FN, TP, and FP values for colon.
Figure 3.4 FN, TP, and FP values for lung.
Precision is the fraction of TP examples among the examples that the model classified as positive. In other words, it is the number of true positives divided by the number of FP plus true positives. Recall, also known as sensitivity, is the fraction of examples classified as positive, among the total number of positive examples. In other words, this is the number of true positives divided by the number of true positives plus FN. In our model, the resultant set of genes has been validated using NCBI database for both colon and lung. From the diagram, the intersection part for colon dataset (Figure 3.3), and for lung dataset (Figure 3.4), the number of TP genes is identified. At the same time, FP and FN values are also identified from the figures in the same way.
Further, we have calculated the precision, recall, and F-score values to check how good our model is. Precision tells us how precise/accurate our model is out of those predicted as positive and how many of them are actual positive. The formula that is used to calculate for precision [Equation (3.7)] and recall [Equation (3.8)] is clearly mentioned.
Recall