Группа авторов

Machine Learning Techniques and Analytics for Cloud Security


Скачать книгу

genes is always a demanding task. Due to presence of diversity and complexity in different types of cancer, the task is more challenging. With the emergence in the field of biotechnology a bulk amount of data is being generated by utilizing high-density oli-gonucleotide chips and cDNA arrays [8, 9]. Researchers now can measure thousands of gene expression data simultaneously. But there is lack of suitable algorithm to extract knowledge and mine the information from this type of biological data source which is very much significant. So, the increased demand always persists to explore and design suitable algorithm/s. While analyzing microarray data, one of the most significant applications is to classify the tissue samples that belong to normal and cancerous state. Nevertheless, during such application, it has always been observed that a large number of genes are identified which are irrelevant. So, this genes has got no impact on clinical application, and as a result, the efficiency of the method gets compromised [10, 11]. On the other side of the coin, working and interpreting with the huge number of genes incurs lack of feasibility. Thus, it is obvious to select accurate number of relevant genes by analyzing microarray data and has become really a promising one. Selecting these important genes is very much important from different angles of medical science which includes drug discovery, targeted therapy, prognosis, and sometimes early detection [12, 13].

      While building a prediction model, LR is reckoned as a popular method where the outcome is binary and has been expanded to provide classification of disease with microarray data. Here, it is necessary to incorporate a feature (gene) selection technique and should be induced to penalize the logistic model. The fundamental reason is that, here, the number of genes is very large compare to number of samples. So, selection of proper model in this procedure needs new statistical methods. This is important because while predicting error assessment, the step for selecting features if ignored, could have impact of severely downward biased. The widely used methods which are mostly generic like cross-validation and non-parametric bootstrap may be not so effective owing to the huge vulnerability in predicting the error estimation process. The classification of diseases like cancer using microarray data has been considered the subject of extensive research in order to provide more precise diagnostic methods than the conventional pathological approach alone can provide. The expression of genes can also be used to predict survival time, disease prognosis and treatment response. The overall impact is very much significant as all the factors are having major clinical consequences. To design a logistic prediction model using microarray data, however, has got a fundamental difference from the standard logistic model owing to the observed number of genes, which often becomes thousands in number while the number of arrays (samples) observed is generally very lesser which is often less than one hundred. A common wise used approach is to combine a step in gene selection with a penalized inference of probability, called selection of features, which selects a subset of genes for inclusion in the LR model.

      3.2 Related Methods

      Linear regression and LR both are statistical methods widely used by ML algorithms. Linear regression is effective for regression or for prediction of values continuous in nature, whereas LR is effective in both regression and classification problems. However, it is widely used mainly in the domain of classification algorithm. Models of regression seek to project values on the basis of independent characteristics. The key distinction which makes them different is when the dependent variables are assumed to be binary; LR is useful. However, when dependent variables are continuous, linear regression seems to be more effective.

      In mathematics, linear models are well defined. For the purpose of predictive analysis, this model is commonly used nowadays. It uses a straight line to primarily address the relationship between a predictor and a dependent variable used as target. Basically, there exists two categories of linear regression, one is known as simple linear regression and the other one is known as multiple regressions. In linear regression, there could be independent variables which are either of type discrete or continuous, but it will have the dependent variables of type continuous in nature. If we assume that we have two variables, X as an independent variable and Y as a dependent variable, then a perfectly suited straight line is fit in linear regression model which is determined by applying a mean square method for finding the association between the independent variable X and the dependent variable Y. The relationship between them is always found to be linear. The key point is that in linear regression, the number of independent variable is one, but in case of multiple regressions, it can be one or more.