less amount of time. The decrease in search space or targeted search will reduce the overall cost of the drug discovery process. The critical problem is how to establish a relationship between the 3D structure of the lead molecule and its biological activity. QSAR is a technique that can able to predict the activity of a set of compounds using the derived equations from a set of known compounds [91]. While in QSPR (quantitative structure–property relationships), one predicts biological activity, using the physicochemical properties of known compounds as a response variable. Accurate prediction of the activity of chemical molecules is still a persistence issue in drug discovery. It is a general phenomenon in structural bioinformatics that if the two protein structures share structural similarities, then their functions may also be the same. Nevertheless, this is not always true in the case of chemical structures, where minute structural differences in pairs of compounds will lead to change in their activity against the same target receptor. This is an activity cliff problem which is being a hot topic of debate among computational and medicinal scientists [92, 93].
The lock-and-key hypothesis and induced fit model hypothesis deal with the biochemistry of binding of a ligand at the receptor. In general, a ligand–receptor complex comprises of a smaller ligand which attaches to the functional cavity of the receptor. The 3D structure information of both ligand, as well as receptor, is essential in order to understand their functional role. There is a change in 3D conformation of receptor protein upon binding of ligands at the active site and thus leads to change in their functional state. X-Ray Crystallography, Nuclear Magnetic Resonance (NMR), Electron Microscopy are the currently available experimental techniques to predict the 3D structure of proteins. Since there is a considerable gap between available protein sequences and their 3D structures, one can harness bioinformatics techniques, namely molecular modeling, to predict their 3D structures in a less amount of time with comparable accuracy. Molecular docking is a technique that can be used to predict the binding mode of ligand at the receptor if their 3D information is available. It is the most commonly used for pose prediction of ligand at the active site of the receptor. The approach of identifying lead compounds using 3D structure information of receptor–protein is known as Structure-Based Drug Design (SBDD). Nowadays, the process of identifying, predicting and optimising the activity of small molecules against a biological target comes under SBDD domain [94–96].
Ligand-based drug design (LBDD) is another approach of drug designing, applicable only when 3D structural information of the receptor is unavailable. LBDD mainly relies on the pre-existing knowledge of compounds that are known to bind with the receptor. The physicochemical properties of known ligands are used to predict their activity and develop SAR to screen unknown compounds [97]. Although artificial intelligence can be applied in both SBDD and LBDD approaches to automate the drug discovery process, its implementation in the LBBD approaches is more common these days. Some recent methods like proteochemometric modeling (PCM) try to extract the individual descriptor information from both ligands as well as the receptors, and also the combined interaction information [98]. The machine learning classifiers use the individual descriptor, as well as cross-descriptor information, for predicting the bioactivity relations.
Biological activity is a broad term that relates to the ability of a compound/target to achieve the desired effect [99]. The bioactivity or biological activity may be divided into the activity of receptor (functionality) and activity of compounds. While in pharmacology, the biological activity is replaced by pharmacological activity, which usually represents the beneficial or adverse effect of drugs on biological systems. The compound must possess both the activity against the target as well as permissible physicochemical properties in order to establish them as an ideal drug candidate. The absorption, distribution, metabolism, excretion and toxicity (ADMET) profile of a compound is required to predict the bioavailability, biodegradability and toxicity of drugs. Initially, the simple descriptor-based statistical models were created for predicting the bioactivity of drug compounds. Later on, the target specificity and selectivity of compounds were increased many folds due to the inclusion of machine learning-based models [100]. The machine learning classifiers may be built and trained based on preexisting knowledge of either molecular descriptors or substructure mining in order to classify new compounds.
One can train the classifiers, and classify the new compounds considering either single or combination of parameters: activity (active/non-active), drug-likeness, pharmacodynamics, and pharmacokinetics or toxicity profiles of known compounds [91]. Nowadays, a lot of open-source as well as commercial applications, are available for predicting skin sensitisation, hepatotoxicity, or carcinogenicity of compounds [101]. Apart from this, several expert systems are in use for finding the toxicity of unknown compounds using knowledgebase information [102, 103]. These expert systems are artificial intelligence-enabled expert systems that are using human knowledge (or intelligence) to reason about problems or to make predictions. They can make qualitative judgements based on qualitative, quantitative, statistical and other evidence provided to them as an input. For instance, DEREK and StAR use the knowledge-based information to derive new rules that can better describe the relationship between chemical structure and their toxicity [102]. DEREK uses a data-driven approach to predict the toxicity of a novel set of compounds given in the training dataset and compare them to given biological assay results to refine the prediction rules. Toxtree is an open-source platform to detect the toxicity potential of chemicals. It uses the Decision Tree (DT) classification machine learning algorithm based classification model to estimate toxicity. The toxicological data of chemicals derived from their structural information is used as an input to feed the model [104].
Besides expert systems, there are also some other automated prediction methods like Bayesian methods, Neural Networks, Support Vector Machines. Bayesian Inference Networks (BIN) is among one of the crucial methods that allow a straightforward representation of uncertainties that are involved in the different medical domains involving diagnosis, treatment selection, prediction of prognosis and screening of compounds [105]. Nowadays, doctors are using these BIN models in the prognosis and diagnosis. Use of BIN models in the ligand-based virtual screening domain tells their successful implications in the field of drug discovery. A comparative study was done to find the efficiency of three models: Tanimoto Coefficient Networks (TAN), conventional BINs and BIN Reweighting Factor (BINRF) for screening billions of drug compounds based on structural similarity information [106]. All three models use MDL Drug Data Report (MMDR) database for training as well as testing purposes. The ligand-based virtual screening, which utilizes the BINRF model, not only significantly improved the search strategy, it also identified the active molecules with less structural similarity, compared to TAN and BIN-based approaches. Thus, this is an era of the integrative approaches to achieve higher accuracy in drug or drug target prediction.
Bayesian ANalysis to determine Drug Interaction Target (BANDIT), uses a Bayesian approach to integrate varied data types in an unbiased manner. It also provides a platform that allows the integration of newly available data types [107]. BANDIT has the potential to expedite the drug development process, as it spans the entire drug search space starting from new target identification and validation to clinical candidate development and drug repurposing.
Support Vector Machine (SVM) is a supervised machine learning technique most often used in knowledge base drug designing [108]. The selection of appropriate kernel function and optimum parameters are the most challenging part in the problem modelling, as both parameters are problem-dependent. Later on, a more specific kernel function is designed that can control the complexity of subtrees by using parameter adjustments. The SVM model integrated with the newly designed kernel function successfully classifies and cross-validates small molecules having anti-cancer properties [109]. Graph kernels-based learning algorithms are widely in SVMs, and they can directly utilise graph information to classify compounds. The graph kernel-based SVMs are employed to classify diverse compounds, to predict their biological activity and to rank them in screening assays. Deep learning algorithms that mimic the human neural system, artificial neural network (ANN) also have applications in the drug discovery process. The comparable robustness of both SVM and ANN algorithms were checked in term of their ability to classify between drug/non-drug compounds [110]. The result is in support of SVM as it can classify the compounds with higher accuracy and robustness compared to ANN.
Other machine learning algorithms: Decision tree, Random forest, logistic regression, recursive partitioning