Qiwen Dong

Biological Language Model


Скачать книгу

secondary structure prediction and protein fold recognition tasks, suggesting its important role in protein structure and function prediction. However, another evolution-based position-dependent encoding method — HMM — does not perform well, and the main reason for this could be that the remote homologous sequences only provide limited evaluation information for the target residue. For the one-hot encoding method, it is highly sparse and leads to complex machine learning models, while its two compressed representations, one-hot (6-bit) encoding and binary 5-bit encoding, lose more or less valuable information and cannot be widely used in related researches. More reasonable strategies to reduce the dimension of one-hot encoding need to be developed. For the physicochemical property encodings, the variety of properties and the extraction methodologies are two important factors needed to construct a valuable encoding. Structure-based encodings and machine-learning encodings achieve comparable or even better performances when compared with other widely used encodings, suggesting more attention needs to be paid to these two categories.

      In a time when the dividends of data and algorithms have been highly released, exploring more effective encoding schemes for amino acids should be a key factor to further improve the performance of protein structure and function prediction. In the following, we provide some perspectives for future related studies. First, updated position-independent encodings should be constructed based on new protein datasets. Except for one-hot encoding, all other position-independent encoding methods construct their encodings based on the information extracted from the native protein sequences or structures. There is no doubt that random errors are unavoidable for those encodings and larger datasets will help to reduce those errors. As the development of sequencing and structure detection techniques has progressed and continues to progress, the number of protein sequences and structures has grown rapidly in the past years. Considering that most of the position-independent encoding methods were proposed one decade ago, it would be valuable to reconstruct them by using new datasets. Second, structure-based or function-based encoding methods require more attention. It has been demonstrated that structure-based encoding methods have ability in protein secondary structure prediction and protein fold recognition. These encodings reflect the structural potential of amino acids, which should be highly correlated with the protein structure and function. With the growing of number of proteins with known structure, the future prospect of structure-based encodings is considerable. Furthermore, the encodings reflecting function potentials may be more useful than others for protein function prediction; thus, exploring function-based encoding methods is a worthwhile topic. Third, the machine-learning encoding methods can be promising topics for future studies. As the amino acid encoding is an open problem, most encoding methods are based on an artificially defined basis, i.e. the physicochemical property encodings are constructed from protein fold-related properties observed by researchers, which will inevitably bring some unknown deviations. However, the machine-learning methods can avoid those artificial deviations by learning the amino acid encoding from biological data automatically. The protein sequences and natural languages share some similarities to a certain extent; for instance, the protein sequences can be comparable to sentences, and the amino acid or polypeptide chains can be comparable to words in languages. Considering that the word distributed representation has achieved comprehensive improved performances in natural language processing tasks, the protein sequences should also gain improvements by using the distributed representations of amino acids or n-gram amino acids. Some recent studies have demonstrated the potential of amino acid-distributed representations in protein family classification, disordered protein identification and protein functional property prediction, but most of these methods are concerned with the n-gram amino acid-distributed representations that cannot be directly used to predict the residue-level properties. Thus, residue-level distributed representations of amino acid is a topic that needs more attention.

      [1]Liu B., Wang X., Lin L., Dong Q., Wang X. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinfo, 2008, 9(1): 510.

      [2]Liu B., Liu F., Wang X., Chen J., Fang L., Chou K.-C. Pse-in-One: A web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res, 2015, 43(W1): W65–W71.

      [3]Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Briefings in Bioinformatics, 2019, 20(4): 1280–1294.

      [4]Zamani M., Kremer S.C. Amino acid encoding schemes for machine learning methods. In the 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), 2011, pp. 327–333.

      [5]Yoo P.D., Zhou B.B., Zomaya A.Y. Machine learning techniques for protein secondary structure prediction: An overview and evaluation. Curr Bioinfo, 2008, 3(2): 74–86.

      [6]Hu H.-J., Pan Y., Harrison R., Tai P.C. Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. IEEE Trans NanoBiosci, 2004, 3(4): 265–271.

      [7]Miyazawa S., Jernigan R.L. Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins, 1999, 34(1): 49–68.

      [8]Lin K., May A.C.W., Taylor W.R. Amino acid encoding schemes from protein structure alignments: Multi-dimensional vectors to describe residue types. J Theor Biol, 2002, 216(3): 361–365.

      [9]Asgari E., Mofrad M.R.K. Continuous distributed representation of biological sequences for deep proteomics and genomics. Plos One, 2015, 10(11): e0141287.

      [10]Kawashima S., Pokarowski P., Pokarowska M., Kolinski A., Katayama T., Kanehisa M. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res, 2008, 36(suppl 1): D202–D205.

      [11]Wang S., Peng J., Ma J., Xu J. Protein secondary structure prediction using deep convolutional neural fields. Sci Rep, 2016, 6.

      [12]Wang J.T.L., Ma Q., Shasha D., Wu C.H. New techniques for extracting features from protein sequences. IBM Syst J, 2001, 40(2): 426–441.

      [13]Dayhoff M.O. A model of evolutionary change in proteins. Atlas Prot Seq Struct, 1978, 5: 89–99.

      [14]White G., Seffens W. Using a neural network to backtranslate amino acid sequences. Electronic J Biotechnol, 1998, 1(3): 17–18.

      [15]Atchley W.R., Zhao J., Fernandes A.D., Drüke T. Solving the protein sequence metric problem. Proc Natl Acad Sci USA, 2005, 102(18): 6395–6400.

      [16]Rose G., Geselowitz A., Lesser G., Lee R., Zehfus M. Hydrophobicity of amino acid residues in globular proteins. Science, 1985, 229(4716): 834–838.

      [17]Betts M.J., Russell R.B. Amino acid properties and consequences of substitutions. Bioinfo Genet, 2003, 317: 289.

      [18]Fauchère J.-L., Charton M., Kier L.B., Verloop A., Pliska V. Amino acid side chain parameters for correlation studies in biology and pharmacology. Chem Biol Drug Design, 1988, 32(4): 269–278.

      [19]Radzicka A., Wolfenden R. Comparing the polarities of the amino acids: side-chain distribution coefficients between the vapor phase, cyclohexane, 1-octanol, and neutral aqueous solution. Biochemistry, 1988, 27(5): 1664–1670.

      [20]Reinhard L., Gisbert S., Dirk B., Paul W. A neural network model for the prediction of membrane spanning amino acid sequences. Prot Sci, 1994, 3(9): 1597–1601.

      [21]Elofsson A. A study on protein sequence alignment quality. Proteins, 2002, 46(3): 330–339.

      [22]Oren E.E., Tamerler C., Sahin D., Hnilova M., Seker U.O.S., Sarikaya M., Samudrala R. A novel knowledge-based approach to design inorganic-binding peptides. Bioinformatics, 2007, 23(21): 2816–2822.

      [23]Henikoff S., Henikoff J.G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA, 1992, 89(22): 10915–10919.

      [24]Henikoff S., Henikoff J.G. Automated assembly of protein blocks for database searching. Nucleic Acids Res, 1991, 19(23): 6565–6572.

      [25]Stormo