this information, so the language of the nucleic acid alone is not sufficient to describe the entire life activity. It can be seen that the research task of both the whole and the dynamic proteome is very heavy and is a follow-up part of the genomic research that is indispensable for elucidating the nature of life activities. Post-genome or -proteome research will undoubtedly become the main task of relay genome research in life science research in the 21st century.
The mapping relationship between a biological sequence and its structure and function is similar to the word-to-semantic mapping relationship in a language.2 In linguistics, words can be arranged into meaningful sentences; in biology, amino acid arrangement represents the structure and function of proteins. The arrangment of amino acids to form a protein can be regarded as similar to a meaningful arrangement of words, thereby leading to the specific structures and functions of proteins. The words in a document map directly to the semantics and contain relevant information about the topic of the article; similarly, the protein sequence can be regarded as the original text, containing information about structure and function, which can be used to further understand the mutual interaction between proteins.
As protein primary structure sequencing technology matures, the amount of genomic and proteomic sequence data continues to increase, as does the associated structural and functional data. These data will increase exponentially over the next decade, making it possible to use a data-driven approach to solve protein sequence–structure–function mapping problems. Data-driven methods have been successfully applied in many areas of natural language processing, such as speech recognition, text categorization, information extraction and machine translation.3
The emergence of a large number of corpora has promoted the development of computational linguistics. Similarly, the emergence of a large amount of protein sequence–structure–function data has enabled computational methods and information techniques to be applied in this field. Computational linguistic tools including statistical language models, text classification techniques, machine learning methods and higher-level language processing methods have been applied to understand the structure and function of proteins in cells. The purpose of this book is to introduce relevant techniques of biological language modeling in bioinformatics and promote the development of protein sequence–structure–function mapping.
1.2Related Topics
1.2.1 Linguistic feature analysis of protein sequences
Protein sequences are similar to the sentences seen in natural language, as both are made up of linear arrangements of basic units. The mapping of sequences to the structures and functions of proteins is conceptually similar to the mapping of words to meanings. This analogy has been studied by a growing body of research,4 but are there any linguistic features in protein sequences? What are the basic units in protein sequence language?
1.2.2 Amino acid encoding for protein sequence
In general, protein sequences are represented by using twenty letters of the amino acid alphabet. Since such a representation cannot be directly processed before it is converted to digital representation, obtaining the digital representation for an amino acid5,6 is the first step of machine-learning-based protein structure and function prediction methods, and effective digital representation7 is crucial to the final success of these methods.
1.2.3 Remote homology detection
With the rapid development of completely sequenced genomes, a large amount of sequence data has been deposited in databases, and now their structure and function need to be elucidated. In general, the easiest way to annotate newly sequenced proteins is to transfer annotations from well-characterized homologous proteins.8 Therefore, the development of a novel algorithm for protein homology detection is of great importance.9,10 This is especially so since remote homology detection — the detection of homologous relationship with low sequence identities — remains a challenging problem in computational biology.11,12
1.2.4 Structure prediction
With the success of a series of genome-sequencing projects, the number of known protein sequences has grown exponentially. The amount of sequence data in the current molecular database far exceeds the amount of structural data, and the acquisition of structural information is very important to reveal the biological function of proteins. However, due to technical difficulties and the laborious nature of structural biology experiments, the speed of protein structure determination lags far behind the increase in the number of sequences. Studying protein structure prediction13 has great theoretical and practical value. In theory, it is beneficial for people to systematically and completely understand the whole process of transferring biological information from DNA to biologically active proteins as well as to clarify the central law more completely.14 Having a deeper understanding of the various phenomena in the life process ultimately promotes the rapid development of life sciences.15 As regards application, it is beneficial for people to analyze disease pathogenesis and find treatment methods, and design proteins with novel biological functions, thereby promoting the rapid development of medicine, agriculture and animal husbandry. Thus, developing efficient computer-based algorithms to predict high-resolution 3D protein structures from their sequences becomes increasingly important.16,17
1.2.5 Function prediction
Proteins are one of the most important molecules in biology as they have a role in many life processes, such as transcription, metabolism and regulation. It is thus of great importance to perform function analysis on proteins to help understand the processes of life.18 Due to the huge amount of proteins present, it is difficult to verify the function of each and every protein. Computational approaches for function prediction are necessary to assist in the functional identification of the proteome.19 The related research9 includes such aspects as interaction prediction and ontology-based function prediction. Since proteins perform their function by binding with other ligands, including proteins, metal ions, DNA, RNA, etc., it is essential to predict the binding sites of proteins to further explore in detail the function of proteins.20
1.3Organization of the Book Content
The structure of this book is organized as follows. First, it begins by providing an introduction to the proteome, the biological language model and its application. Then, several research topics of the biological language model are proposed, with detailed introductions on the background and a description of the methods, i.e. linguistic feature analysis of protein sequences, amino acid encoding for protein sequences, protein remote homology detection, protein structure prediction and protein function prediction. For the topic of linguistic feature analysis of protein sequences, the n-grams of whole genome protein sequences from 20 organisms were extracted to obtain statistical sequence analysis results for a large number of genomic and proteomic sequences available for different organisms. Their linguistic features were analyzed by two tests — Zipf’s power law and Shannon’s entropy — developed for analysis of natural languages and symbolic sequences. As regards amino acid encoding, a comprehensive review of the available methods for this is proposed, and these methods are grouped into five categories according to their information sources and information extraction methodologies, which are as follows: binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding and machine-learning encoding. For protein remote homology detection, latent semantic analysis is used to extract and represent the contextual-usage meaning of words of protein sequences by statistical computations, and the auto-cross covariance transformation is introduced to transform protein sequences into fixed-length vectors. For the protein structure prediction topic, a novel index at the profile level is presented for protein domain linker prediction, a building-block library-based method has been presented to predict the local structures and the folding fragments of proteins, conformational entropy is used as an indicator of protein flexibility and a class of novel nonlinear knowledge-based mean force potentials is presented. For the protein function prediction topic, profile-level interface propensities are used for binding site prediction, sequence