Qiwen Dong

Biological Language Model


Скачать книгу

the evolution-based encoding, structure-based encoding and machine-learning encoding methods extract information based on the physicochemical properties of the amino acid by using difference strategies. Specifically, different amino acids may have different mutation tendencies in the evolutionary process due to their hydrophobicity, polarity, volume and other properties. These mutation tendencies will be reflected in the sequence alignments and are detected by the evolution-based encoding methods. Similarly, the physicochemical properties of amino acids could affect the inter-residue contact potentials in tertiary protein structures, which form the basis of the structure-based encoding methods. And the machine-learning encoding methods also learn amino acid encoding from its physicochemical representation or evolution information (such as homologous protein structure alignments), which can be seen as another variant of physicochemical properties. Despite the fact that these encoding methods share a similar theoretical basis, their performance is different due to the restrictions in their implementation. As regards the one-hot encoding method, there is no artificial correlation between amino acids, but it is highly sparse and redundant, which leads to a complex machine learning model. The physicochemical properties of amino acids play fundamental roles in the protein folding process; theoretically, the physicochemical property encoding methods should be effective. However, as the protein folding-related physicochemical properties and their digital metrics are unknown, developing an effective physicochemical property encoding method is still an unresolved problem. The evolution-based encoding methods extract evolution information using just protein sequences, which could thus benefit from the dividends of large-scale protein sequence data. In particular, PSSM has shown significant performance in many studies.44 However, for those proteins without homologous sequences performances of evolution-based methods are limited. The structure-based encoding methods encode amino acids based on the potential of inter-residue contact, which denotes a low-dimensional representation of protein structure. Because of the limited number of known protein structures, their performance scope is limited. Early machine-learning encoding methods also face the problem of insufficient data samples, but several methods developed recently have overcome this problem by taking advantage of unlabeled sequence data.9,38,39

      As discussed, different amino acid encoding methods have specific advantages and limitations; so, what is the most effective encoding method? According to Wang et al.,12 the best encoding method should significantly reduce the uncertainty of the output of the prediction model, or the encoding could capture both the global similarity and the local similarity of protein sequences; here, the global similarity refers to the overall similarity among multiple sequences while the local similarity refers to motifs in the sequences. Riis and Krogh35 proposed that redundancy encodings will lead the prediction model to be overfitting, and thus it needs to be simplified. Meiler et al.37 also tried to use reduced representations of amino acids’ physicochemical and statistical properties for protein secondary structure prediction. Zamani and Kremer4 stated that an effective encoding must store information associated with the problem at hand while diminishing superfluous data. In summary, an effective amino acid encoding method should be information-rich and non-redundant. “Information-rich” means the encoding contains enough information that is highly relevant to the protein structure and function, such as the physicochemical properties, evolution information, contact potential, and so on. “Non-redundant” means the encoding is compact and does not contain noise or other unrelated information. For example, in neural network-based protein structure and function prediction, redundancy encoding will lead to complicated networks with a very large number of weights, which leads to overfitting and restricts the generalization ability of the model. Therefore, under the premise of containing sufficient information, a more compact encoding will be more useful and generate more results.

      Over the past two decades, several studies have been proposed to investigate effective amino acid encoding methods.5 David45 examined the effectiveness of various hydrophobicity scales by using a parallel cascade identification algorithm to assess the structure or functional classification of protein sequences. Zhong et al.46 compared orthogonal encoding, hydrophobicity encoding, BLOSUM62 encoding and PSSM encoding utilizing the Denoeux belief neural network for protein secondary structure prediction. Hu et al.6 combined orthogonal encoding, hydrophobicity encoding and BLOSUM62 encoding to find the most optimal encoding scheme by using the SVM with a sliding window training scheme for protein secondary structure prediction. From their test results, it can be seen that the combination of orthogonal and BLOSUM62 matrices showed the highest accuracy compared with all other encoding schemes. Zamani and Kremer4 investigated the efficiency of 15 amino acid encoding schemes, including orthogonal encoding, physicochemical encoding, and secondary structures- and BLOSUM62-related encoding, by training artificial neural networks to approximate the substitution matrices. Their experimental results indicate that the number (dimension) and the types (properties) of amino acid encoding methods are the two key factors playing a role in the efficiency of the encoding performance. Dongardive and Abraham47 compared the orthogonal, hydrophobicity, BLOSUM62, PAM250 and hybrid encoding schemes of amino acids for protein secondary structure prediction and found that the best performance was achieved using the BLOSUM62 matrix. These studies thus explored amino acid encoding methods from different perspectives, but they all just evaluated one part of the encoding methods on small datasets. To present a comprehensive and systematic comparison, in this chapter, we performed a large-scale comparative assessment of various amino acid encoding methods based on two tasks — protein secondary structure prediction and protein fold recognition — proposed in the following sections. It should be noted that our aim is assessing how much effective information is contained in different encoding methods, rather than exploring the optimal combination of encoding methods.

      In computational biology, protein sequence labeling tasks, such as protein secondary structure prediction, solvent accessibility prediction, disorder region prediction and torsion angle prediction, have gained a great deal of attention from researchers. Among those sequence labeling tasks, protein secondary structure prediction is the most representative task,48 and several previous amino acid encoding studies have also paid attention to this topic.6,35,46,47 Therefore, we first assess the various amino acid encoding methods based on the protein secondary structure prediction task.

       3.4.1 Encoding methods selection and generation

      To perform a comprehensive assessment of different amino acid encoding methods, we select 16 representative encoding methods from each category for evaluation. A brief introduction of the 16 selected encoding methods is shown in Table 3-2. Except for PSSM and HMM encodings, most of these encodings are position-independent encodings and can be used directly to encode amino acids. It should be noted that some protein sequences may contain unknown amino acid types; these amino acids will be expressed by the average value of the corresponding column if the original encodings do not deal with this situation. For the ProtVec,9 which is a 3-gram encoding, we encode each amino acid by adding its left and right adjacent amino acid to form the corresponding 3-gram word. Since the start and end amino acids do not have enough adjacent amino acids to form 3-grams, they are represented by the “<unk>” encoding in ProtVec. Recently, further work on ProtVec (ProtVecX49) has demonstrated that the concatenation of ProtVec and k-mers could achieve better performance; here, we also evaluate the performance of ProtVec concatenated with 3-mers (named as ProtVec-3mer). For position-dependent encoding methods PSSM and HMM, we follow the common practice of generating them. Specifically, for the PSSM encoding of each protein sequence, we ran the PSI-BLAST26 tool with an e-value threshold of 0.001 and three iterations against the UniRef950 sequence database which is filtered at 90% sequence identity. HMM encoding is extracted from the HMM profile by running HHblits27 against the UniProt2050 protein database with parameters “-n 3 -diff inf -cov 60”. According to the HHsuite user guide, we use the first 20 columns of the HMM profile and convert the integers in the HMM profile to amino acid emission frequencies by using the formula: hfre = 2−0.001∗h, where h is the initial integer in the HMM