Qiwen Dong

Biological Language Model


Скачать книгу

remote homologous sequences, while its MSA from the UniRef90 database usually contains more homologous sequences. From the results in Table 3-4, the evaluation information of homologous sequences is more powerful for distinguishing different protein secondary structures than that of remote homologous sequences. For the structure-based encodings, the Micheletti potentials have much better performance when the BRNN method is used than when the Random Forests method is used. For machine-learning encodings, the ProtVec and ProtVec-3mer achieve significantly better performance compared with the values given in Table 3-4, which demonstrates the potential of machine-learning encoding. It is worth noting that ProtVec-3mer has better performance than ProtVec on the BRNN algorithm, corresponding to the authors’ recent work.49 Overall, for the deep learning algorithm BRNN, the position-dependent PSSM encoding still performs best among all encoding methods. For the position-independent encoding methods, the Micheletti potentials achieve the best performance, which demonstrates that the structure-related information has application potential in protein structure and function studies.

      Figure 3-4 The architecture of the long short-term memory (LSTM) bidirectional recurrent neural networks for protein secondary structure prediction.

image

      In addition to the protein sequence labeling tasks, protein sequence classification tasks have also received a lot of attention, such as protein remote homology detection60 and protein fold recognition.61,62 Here, we perform another assessment of the selected 16 amino acid encoding methods based on the protein fold recognition task. Many machine learning methods have been developed to classify protein sequences into different fold categories for protein fold recognition.60 The deep learning methods can automatically extract discriminative patterns from variable-length protein sequences and achieve significant success.61 Referring to Hou’s work,61 we used the one-dimensional deep convolution neural network (DCNN) to assess the usefulness of 16 selected encoding methods for protein fold recognition. As shown in Fig. 3-5, the deep convolution neural network used here has 10 hidden layers of convolution, 10 filters of each convolution layer with two window sizes (6 and 10), 20 maximum values at the max pooling layer and a flatter layer which is fully connected with the output layer to output the corresponding probability of each fold type.

      Figure 3-5 The architecture of the one-dimensional deep convolution neural network for protein fold recognition.

       3.5.1 Benchmark datasets for protein fold recognition

      The most commonly used dataset to evaluate protein fold recognition methods is the SCOP database63 and its extended version, the SCOPe database.64 The SCOP is a manual structural classification of proteins whose three-dimensional structures have been determined. All of the proteins in SCOP are classified into four hierarchy levels: class, fold, superfamily and family. Folds represent the main characteristics of protein structures, and the protein fold could reveal the evolutionary process between the protein sequence and its corresponding tertiary structure.65 Here we use the F184 dataset which was constructed by Xia et al.66 based on the SCOPe database. The F184 dataset contains 6451 sequences with less than 25% sequence identity from 184 folds. Each fold contains at least 10 sequences, which could ensure that there are enough sequences for training and test purposes. Then we randomly selected 20% of the sequences as test data from each fold, leaving 80% of the sequences as training data. Finally, we got 5230 sequences for training and 1221 sequence for testing.

       3.5.2 Performances of different encodings on protein fold recognition task

      The comparison results of 16 selected encoding methods for protein fold recognition are listed in Table 3-5. It should be noted that the training process for each encoding method is repeated 10 times to eliminate stochastic effects. Different from the performances of protein secondary structure prediction, the performances of most position-independent encoding methods are similar. All of the binary, physicochemical and machine-learning-based encoding methods (except the ProtVec) achieve about 30% mean accuracies, demonstrating that the position-independent encodings could just offer limited information for protein fold classification. The two structure-based encodings have better accuracies — near 33% — demonstrating that the structure potential is more related with the protein fold type. The two evolution-based methods PAM250 and BLOSUM62 perform best among the 12 position-independent encoding methods, which means the evaluation information is more coupled with the protein structure. The position-dependent encoding methods PSSM and HMM achieve better performances, especially PSSM. It again indicates that the protein evaluation information is tightly coupled with the protein structure, and the homologous information is more useful than remote homologous information. The machine-learning-based AESNN3 and ANN4D encodings achieve comparable performances with other position-independent encoding methods but have much lower dimensions (3 for the AESNN3 and 4 for the ANN4D), showing its potential for further application. The performance of the ProtVec encoding is poor, and this could be caused by the overlapping strategy that has also been mentioned by the author.9 The ProtVec-3mer encoding has better performance, demonstrating the effectiveness of the combination of ProtVec and 3-mer.

image

      Notes: Top 1: the accuracy calculated in the case that the first predicted folding type is the actual folding type. Top 5: the accuracy calculated in the case that the top 5 predicted fold types contain the actual fold type. Top 5: the accuracy calculated in the case that the top 10 predicted fold types contain the actual fold type. Mean: the mean value of accuracies on Top 1, Top 5, and Top 10.

      It should be noted that the benchmark presented here is based on the DCNN method, and these encodings may achieve different performances by using other machine learning methods. The DCNN method could handle variable-length sequences and achieve significant success on fold recognition tasks, which are the main reasons for its selection here.

      Amino acid encoding is the first step of protein structure and function prediction, and it is one of the foundations to achieve final success in those studies. In this chapter, we proposed the systematic classification of various amino acid encoding methods and reviewed the methods of each category. According to information sources and information extraction methodologies, these methods are grouped into five categories: binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding and machine-learning encoding. To benchmark and compare different amino acid encoding methods, we first selected 16 representative methods from those five categories. And then, based on the two representative protein-related studies, protein secondary structure prediction and protein fold recognition, we construct three machine learning models referring to the state-of-the-art studies. Finally, we encoded the protein sequence and implemented the same training and test phase on the benchmark datasets for each encoding method. The performance of each encoding method is regarded as the indicator of its potential in protein structure and function studies.

      The assessment results show that the evolution-based position-dependent encoding method PSSM consistently