Qiwen Dong

Biological Language Model


Скачать книгу

function prediction and the n-gram biological language model from natural language processing has been used to filter the missing proteins. Finally, the conclusion and future perspectives are proposed.

      [1]Wasinger V.C. Progress with gene-product mapping of the mollicutes: Mycoplasma genitalium. Electrophresis, 1995, 16(7): 1090–1094.

      [2]Ganapathiraju M., Balakrishnan N., Reddy R., Klein-Seetharaman J. Computational biology and language. Ambient intelligence for scientific discovery. LNAI, 2005, 3345: 25–47.

      [3]Manning C.D., Schütze H. Foundations of Statistical Natural Language Processing. 1999. Cambridge, MA: MIT Press.

      [4]Ganpathiraju M., Weisser D., Rosenfeld R., Carbonell J., Reddy R., Klein-Seetharaman J. Comparative n-gram analysis of whole-genome protein sequences. In Proceedings of the Human Language Technologies Conference, San Diego, 2002, pp. 1367–1375.

      [5]Tanaka S., Scheraga H.A. Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules, 1976, 9(6): 945–950.

      [6]Yang K.K., Wu Z., Bedbrook C.N., Arnold F.H. Learned protein embeddings for machine learning. Bioinformatics, 2018, 34(15): 2642–2648.

      [7]Asgari E., McHardy A.C., Mofrad M.R. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci Rep, 2019, 9(1): 3577.

      [8]Moult J., Fidelis K., Kryshtafovych A., Schwede T., Tramontano A. Critical assessment of methods of protein structure prediction (CASP) — Round XII. Proteins: Structure, Function, and Bioinformatics, 2018, 86: 7–15.

      [9]Guo Y., Yu L., Wen Z., Li M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res, 2008, 36(9): 3025–3030.

      [10]Haandstad T., Hestnes A.J., Saetrom P. Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinformatics, 2007, 8(1): 23.

      [11]Lingner T., Meinicke P. Remote homology detection based on oligomer distances. Bioinformatics, 2006, 22(18): 2224–2231.

      [12]Yang Y., Tantoso E., Li K.B. Remote protein homology detection using recurrence quantification analysis and amino acid physicochemical properties. J Theor Biol, 2008, 252(1): 145–154.

      [13]Li J., Cai J., Su H., Du H., Zhang J., Ding S., Liu G., Tang Y., Li W. Effects of protein flexibility and active site water molecules on the prediction of sites of metabolism for cytochrome P450 2C19 substrates. Mol Biosyst, 2016, 12(3): 868–878.

      [14]Manoharan P., Chennoju K., Ghoshal N. Target specific proteochemometric model development for BACE1 — Protein flexibility and structural water are critical in virtual screening. Mol Biosyst, 2015, 11(7): 1955–1972.

      [15]Antunes D.A., Devaurs D., Kavraki L.E. Understanding the challenges of protein flexibility in drug design. Expert Opin Drug Discov, 2015, 10(12): 1301–1313.

      [16]Yang J., Wang Y., Zhang Y. ResQ: An approach to unified estimation of B-Factor and residue-specific error in protein structure prediction. J Mol Biol, 2016, 428(4): 693–701.

      [17]Sharma A., Manolakos E.S. Efficient multicriteria protein structure comparison on modern processor architectures. Bio Med Res Int, 2015, 2015: 13.

      [18]Tetko I.V., Rodchenkov I.V., Walter M.C., Rattei T., Mewes H.W. Beyond the ‘best’ match: Machine learning annotation of protein sequences by integration of different sources of information. Bioinformatics, 2008, 24(5): 621–628.

      [19]Kim M.-S., Pinto S.M., Getnet D., Nirujogi R.S., Manda S.S., Chaerkady R., Madugundu A.K., Kelkar D.S., Isserlin R., Jain S. A draft map of the human proteome. Nature, 2014, 509(7502): 575–581.

      [20]Nanni L., Lumini A. An ensemble of K-local hyperplanes for predicting protein-protein interactions. Bioinformatics, 2006, 22(10): 1207–1210.

       Chapter 2

       Linguistic Feature Analysis of Protein Sequences

       2.1Motivation and Basic Idea

      Proteins play an important role in the function of complex biological systems. But the relationship between primary sequences, three-dimensional structures and functions of proteins is one of the most important unanswered questions in biology. With the completion of the Human Genome Project and all kinds of work in assessing biological sequences accurately, a large number of genomic and proteomic sequences are available for different organisms at present. The exponential increase of these data provides an opportunity for us to attack the sequence–structure–function mapping problem with sophisticated data-driven methods. Such methods have been successfully used in the domain of natural language processing. There are analogies between biological sequences and natural language. In linguistics, some words and phrases can form a meaningful sentence, while in biology, some tactic nucleotides denote genes and some fixed protein sequences can determine the structure and function of the protein.1 But is there a “language” in biological sequences?

      Mantegna2 analyzed the linguistic features of noncoding DNA and emphasized that there exists a “language” in noncoding DNA. Although there are some insufficiencies in the work,3–5 many methods used in natural language processing have been used in biological sequences. N-grams of DNA6 and protein7 have been extracted. A bio-dictionary has been built and used to annotate proteins.8 Latent semantic analysis has been used to characterize the secondary structure of proteins.9 Probabilistic models from speech recognition have been used to enhance the protein domain discovery.10

      The n-gram analysis method is one of the most frequently used techniques in computational linguistics. It takes the assumption that only the previous n − 1 words in a sentence have an effect on the probabilities for the next word.11 It has been successfully used in automatic speech recognition, document classification, information extraction, statistical machine translation and other challenging tasks in natural language. In this chapter, the n-grams of whole genome protein sequences have been extracted, their Zipf’s law has been analyzed and some statistical features have been extracted from the n-grams.

      Amino acids are treated as words, since each amino acid carries a chemical “meaning”. In order to extract the n-gram from whole genome protein sequences, all the proteins of the same organism were arranged in series and split by blank, e.g. protein1 protein2 protein3 etc. Due to the large size of the genomic data, the suffix array12,13 was used to reduce the computational cost. To extract the n-gram statistical data, we developed a toolkit that can carry out the following functions:

      1.Count protein number and length.

      2.Count n-grams and most frequent n-grams.

      3.Count n-grams of specified length.

      4.Determine relative frequencies of specific n-grams across organisms.

      5.Assess the distribution of n-gram frequencies in a specific organism.

      The method was applied to protein sequences derived from whole genome sequences of 20 organisms. The protein sequence data was downloaded from the Swiss-prot database.14 The number of proteins varies from 484 (Mycoplasma genitalium) to 25612 (Human).

      We developed a modification of Zipf-like analysis that could reveal differences between word usage in different organisms. First, the amino acid n-grams of a given length were sorted in descending order by frequency for the organism of choice. The comparative n-gram plots comparing the n-grams of one organism to those of other organisms