Группа авторов

Machine Learning Techniques and Analytics for Cloud Security


Скачать книгу

for examination purpose. Data of both normal and carcinogenic states are given as the input of the algorithm to generate the target output. The algorithm follows a hybrid approach where PCA has been incorporated for minimization of dimensionality of the dataset. Then, prepared logistic model is applied as a binary classifier to detect the collection of genes which might have possible relation with cancer. Our developed PC-LR model is applied on both lung and colon data.

      3.4.1 Description of the Dataset

      In our algorithm, two datasets, viz., lung and colon, are considered for testing and getting the output. With the help of microarray experiments, human gene expression is measured for lung and the data is obtained for tumor and normal sample. Total of 96 samples are collected of which 86 samples belong to tumor and 10 as normal state. In a more descriptive manner, it can be stated that among 86 samples of lung adenocarcinoma, 67 belong to stage I and 19 is of stage III. Ten lung samples are identified as neoplastic sample. The colon data consists of 7,464 genes with 18 samples that belong to carcinogenic state and 18 with normal state. More detailed information can be accessed from the site https://www.ncbi.nlm.nih.gov

      3.4.2 Result Analysis

      While executing the algorithm taking r = 5, i.e., a group of five genes is selected at random at a time. So, for lung dataset, it consisting of 5 cols (genes/features) and 96 rows (samples), which is divided into test and training dataset. For colon, it is 5 and 36, divided in same manner. Here, test data consist of 20% of the dataset and rest 80% belongs to training dataset. This dataset is scaled down by applying standard scalar and features of dataset is brought down onto unit scale. Then, PCA is applied on the selected 5 × 96 matrix. While applying PCA, the variance α is taken as 0.95 as number of components, parameter on both lung and colon datasets.

      For lung dataset, 886 genes were selected. When these genes were matched with the genes in the NCBI database, 102 were found to be true positive (TP). For colon dataset, 207 genes were selected out of which 85 were found to be TP when matched with NCBI database.

      3.4.3 Result Set Validation

      It is very important while developing an efficient algorithm using ML model with a skewed dataset. For example, if the dataset is about cancer detection, then the task becomes more significant. Accuracy alone cannot decide for a skewed dataset whether the algorithm is working efficiently or not. What happens is that if we see in the dataset that in 99% of the time, then there is no cancer. In a binary classification problem, we can easily predict 0 all the time (predicting 1 if cancer and 0 if no cancer) to get a 99% accuracy. If we implement that model, then we will have a 99% accurate model based on ML algorithm but we will never detect cancer. If someone has cancer, then s/he will never get detected and will not get treatment. In our problem, we want to detect cancer mediating genes whose expression level changes significantly from normal state to cancerous state. So, here also, only accuracy is not going to work. There are different evaluation matrices that can help with these types of datasets. Those evaluation metrics are called precision-recall evaluation metrics. The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall. The F-score is commonly used effectively for many kinds of ML models. Moreover, for a binary classification problem, it is very much significant to analyze the accuracy vs. F-score to evaluate the efficiency of the model. Accuracy is defined as simply the number of correctly categorized examples divided by the total number of examples. Accuracy can be useful but does not take into account the subtleties of class imbalances, or differing costs of FN and FP. On the other hand, F-score is an effective measure when there are either differing costs of FP or FN or where there is a large class imbalance. As our proposed method works with gene expression data where number of genes is very large in number but the number of genes whose mutation is correlated to cancer will be very less, so in this case, the accuracy would be misleading, since a classifier that classifies set of genes not related to cancer would automatically get 90% accuracy but would be useless for the proposed work and hence will have little contribution in real-world application specially in the field of medical science. As a result, F-score has been given importance to evaluate the efficacy of the proposed model by proper application precision and recall.

Schematic illustration of FN, TP, and FP values for colon. Schematic illustration of FN, TP, and FP values for lung.