Biomedical Data Mining for Information Retrieval. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Biomedical Data Mining for Information Retrieval

[60] SCOP2 http://scop2.mrc-lmb.cam.ac.uk/ [61] CATH http://www.cathdb.info/ [62]

The various predictive models for protein structure prediction are hidden Markov models, neural networks, support vector machines, Bayesian methods, and clustering methods.

Hidden Markov Model for Prediction HMMs are among the most important techniques for protein fold recognition. In the HMM version of profile–profile methods, the HMM for the query is aligned with the prebuilt HMMs of the template library. This form of profile–profile alignment is also computed using standard dynamic programming methods. Earlier HMM approaches, such as SAM [63] and HMMer [64], built an HMM for a query with its homologous sequences and then used this HMM to score sequences with known structures in the PDB using the Viterbi algorithm, an instance of dynamic programming methods. This can be viewed as a form of profile-sequence alignment. More recently, profile–profile methods have been shown to significantly improve the sensitivity of fold recognition over profile–sequence, or sequence–sequence, methods [65].

Neural Networks (NNs) It is very challenging to determine the structure of a protein if its sequence is given and hence making function determination more difficult. Since a lot of molecular interaction and various levels of folding are involved in a functional protein simple input of sequence will not result in desired output. Deep learning methods are rapidly evolving field in the context of complex relationships between input features and desired outputs which has been put to great use in structure prediction. Various deep neural network architectures resembling the neural network of a human have been proposed which includes deep feed-forward neural networks, recurrent neural networks and neural Turing machines and memory networks. Such advancements are making this field more competitive and accurate and a comparison can be made to a human brain where it receives so many information as inputs but is able to analyze and come to a logical conclusion.

Pattern recognition and classification are important tools of NN. Examples of early NN methods that are still widely used today are PHD [66, 67] PSIPRED [68] and JPred [69] though advancement has occurred to a great deal as Deep neural network (DNN) models have been shown have an advantage of performance in image and language based problems [70] and has been seen to extend to some specific CASP areas such as residue-residue contact prediction and direct use for accurate tertiary structure generation [71–75].

Support Vector Machines (SVMs) Support Vector Machine (SVM) is a supervised Machine Learning technique that has been used to rank protein models [76]. SVM has been put to use in pattern classification problems related to biology. Support Vector Machine method is performed based on the database derived from SCOP, in which protein domains are classified which is based on

1 Known structures of protein in the data bank

2 Evolutionary relationships of the predicted protein

3 The various principles of bond formation governing the 3-D structure of protein.

The advantages of SVM include avoidance of over-fitting very effectively which is a disadvantage with several other methods and is able to manage large feature spaces, and condensation of large amount of information data.

Bayesian Methods The most successful methods for determining secondary structure from primary structure use machine learning approaches that are quite accurate, but they do not directly incorporate structural information. There is a need to determine higher order protein structure which can provide a better and deeper understanding of protein’s function in the cell as structure and function are strongly related. Various computational prediction methods have been developed for the prediction of secondary structure if the primary amino acid sequence is available and one such computational methods is the Bayesian method

The knob-socket model of protein packing in secondary structure forms the basis of Bayesian model. As it is known that when packaging of protein may result in residues that are packed close in space but distant in sequence if the primary structure is seen [77, 78] which is not taken into account by several other methods. The Bayesian model method considers the packing influence of residues on the secondary structure determination. Thus this method has an advantage over other methods of having constructs for the direct inclusion and prediction of the secondary states of coil and turn. Where other secondary structure prediction methods are indirect and do not make direct prediction of coil structure of alpha helix and beta sheet. The secondary folding is very much dependent upon the surrounding environment (aqueous/non aqueous) as a lot of hydrogen bonding and hydrophobic is involved. Thus this method helps in developing the understanding of the environment responsible for secondary structure formation.

Clustering Methods A protein rarely performs its function in isolation, various kinds of interaction is needed to perform its function [79] as discussed earlier in this chapter in context to quaternary structure. Protein–protein interactions are thus fundamental to almost all biological processes [80] and it’s really important to understand this phenomenon. Increasing availability of large-scale protein-protein interaction data has made it possible to understand the basic components and organization of cell machinery from the network level in terms of interactions taking place. Protein–protein interactions can be studied by advance high-throughput technologies such as yeast-two-hybrid, mass spectrometry, and protein chip technologies and making available huge data sets of such interactions [81] which can be put to great use in structure prediction. In computation analysis such protein– protein interaction data can be naturally represented in the form of networks. This network representation can provide the initial global picture of protein interactions on a genomic scale and can also help to build an understanding of the basic components and organization of cell machinery. In Clustering method protein interaction network is represented as an interaction graph. In this graphical representation the proteins are as vertices (or nodes) and interactions as edges. This method has been put to use in the study of surface or topological properties of protein interaction including the network diameter, the distribution of vertex degree, the clustering coefficient and shows that there is scale-free network [82–85] and effects in a very small area [86, 87]. It has been observed and shown that clustering protein interaction networks is an effective approach for system biology to understand the relationship between the organization of a network and its function [88] making it a very effective tool.

The proteins are grouped into sets (clusters) helping to demonstrate greater similarity among proteins in the same cluster than in different clusters. The clusters have two which are protein complexes and functional modules. Protein complexes are groups of proteins that interact with each other at the same time and place which form a single multimolecular structure as evident in RNA splicing and polyadenylation machinery, protein export and transport complexes to name a few [89]. The difference between protein complex and functional modules is that the functional module consists of proteins binding each other at a different time and place and participating in a cellular process. Example of functional module includes the yeast pheromone response pathway, MAP signalling cascades, etc. [90] which initiates with an extracellular signaling leading to a signal cascade pathway resulting in gene activation and other processes.

2.7 Role of Artificial Intelligence in Computer-Aided Drug Design

High throughput screening (HTS) is a set of techniques that are capable of identifying biologically active molecules with desired properties from any compound database of billions of compounds. The prediction and identification of active compounds with high accuracy and activity are crucial to decrease the time taken to discover potent drugs. Different medicinal chemistry-related companies use screening techniques to identify active compounds from drug databases in a significantly

Скачать книгу