Группа авторов

Biomedical Data Mining for Information Retrieval


Скачать книгу

less amount of time. The decrease in search space or targeted search will reduce the overall cost of the drug discovery process. The critical problem is how to establish a relationship between the 3D structure of the lead molecule and its biological activity. QSAR is a technique that can able to predict the activity of a set of compounds using the derived equations from a set of known compounds [91]. While in QSPR (quantitative structure–property relationships), one predicts biological activity, using the physicochemical properties of known compounds as a response variable. Accurate prediction of the activity of chemical molecules is still a persistence issue in drug discovery. It is a general phenomenon in structural bioinformatics that if the two protein structures share structural similarities, then their functions may also be the same. Nevertheless, this is not always true in the case of chemical structures, where minute structural differences in pairs of compounds will lead to change in their activity against the same target receptor. This is an activity cliff problem which is being a hot topic of debate among computational and medicinal scientists [92, 93].

      Ligand-based drug design (LBDD) is another approach of drug designing, applicable only when 3D structural information of the receptor is unavailable. LBDD mainly relies on the pre-existing knowledge of compounds that are known to bind with the receptor. The physicochemical properties of known ligands are used to predict their activity and develop SAR to screen unknown compounds [97]. Although artificial intelligence can be applied in both SBDD and LBDD approaches to automate the drug discovery process, its implementation in the LBBD approaches is more common these days. Some recent methods like proteochemometric modeling (PCM) try to extract the individual descriptor information from both ligands as well as the receptors, and also the combined interaction information [98]. The machine learning classifiers use the individual descriptor, as well as cross-descriptor information, for predicting the bioactivity relations.

      Biological activity is a broad term that relates to the ability of a compound/target to achieve the desired effect [99]. The bioactivity or biological activity may be divided into the activity of receptor (functionality) and activity of compounds. While in pharmacology, the biological activity is replaced by pharmacological activity, which usually represents the beneficial or adverse effect of drugs on biological systems. The compound must possess both the activity against the target as well as permissible physicochemical properties in order to establish them as an ideal drug candidate. The absorption, distribution, metabolism, excretion and toxicity (ADMET) profile of a compound is required to predict the bioavailability, biodegradability and toxicity of drugs. Initially, the simple descriptor-based statistical models were created for predicting the bioactivity of drug compounds. Later on, the target specificity and selectivity of compounds were increased many folds due to the inclusion of machine learning-based models [100]. The machine learning classifiers may be built and trained based on preexisting knowledge of either molecular descriptors or substructure mining in order to classify new compounds.

      Besides expert systems, there are also some other automated prediction methods like Bayesian methods, Neural Networks, Support Vector Machines. Bayesian Inference Networks (BIN) is among one of the crucial methods that allow a straightforward representation of uncertainties that are involved in the different medical domains involving diagnosis, treatment selection, prediction of prognosis and screening of compounds [105]. Nowadays, doctors are using these BIN models in the prognosis and diagnosis. Use of BIN models in the ligand-based virtual screening domain tells their successful implications in the field of drug discovery. A comparative study was done to find the efficiency of three models: Tanimoto Coefficient Networks (TAN), conventional BINs and BIN Reweighting Factor (BINRF) for screening billions of drug compounds based on structural similarity information [106]. All three models use MDL Drug Data Report (MMDR) database for training as well as testing purposes. The ligand-based virtual screening, which utilizes the BINRF model, not only significantly improved the search strategy, it also identified the active molecules with less structural similarity, compared to TAN and BIN-based approaches. Thus, this is an era of the integrative approaches to achieve higher accuracy in drug or drug target prediction.

      Support Vector Machine (SVM) is a supervised machine learning technique most often used in knowledge base drug designing [108]. The selection of appropriate kernel function and optimum parameters are the most challenging part in the problem modelling, as both parameters are problem-dependent. Later on, a more specific kernel function is designed that can control the complexity of subtrees by using parameter adjustments. The SVM model integrated with the newly designed kernel function successfully classifies and cross-validates small molecules having anti-cancer properties [109]. Graph kernels-based learning algorithms are widely in SVMs, and they can directly utilise graph information to classify compounds. The graph kernel-based SVMs are employed to classify diverse compounds, to predict their biological activity and to rank them in screening assays. Deep learning algorithms that mimic the human neural system, artificial neural network (ANN) also have applications in the drug discovery process. The comparable robustness of both SVM and ANN algorithms were checked in term of their ability to classify between drug/non-drug compounds [110]. The result is in support of SVM as it can classify the compounds with higher accuracy and robustness compared to ANN.

      Other machine learning algorithms: Decision tree, Random forest, logistic regression, recursive partitioning