Qiwen Dong

Biological Language Model


Скачать книгу

and function.

      Qiwen Dong

      Xiuzhen Hu

      Xiaoyang Jing

      Aoying Zhou

       Acknowledgments

      This work was supported by the National Key Research and Development Program of China under grant 2016YFB1000905 and the National Natural Science Foundation of China (Grant No. U1401256, U1711262, U1811264, 61672234, 61961032, 31260203, 61402177).

      We would like to thank all the people who have made contributions to and given their valuable suggestions regarding this book, especially Bin Liu, Ming Gao, Dingjiang Huang and Daocheng Hong. We would also like to express our sincere thanks and appreciation to the people at University Press, for their generous help throughout the publication preparation process.

       Contents

       East China Normal University Scientific Reports

       Preface

       Acknowledgments

       1.Introduction

       1.1Background and Motivation

       1.2Related Topics

       1.3Organization of the Book Content

       References

       2.Linguistic Feature Analysis of Protein Sequences

       2.1Motivation and Basic Idea

       2.2Comparative n-gram Analysis

       2.3The Zipf Law Analysis

       2.4Distinguishing the Organisms by Uni-Gram Model

       2.5Conclusions

       References

       3.Amino Acid Encoding for Protein Sequence

       3.1Motivation and Basic Idea

       3.2Related Work

       3.3Discussion

       3.4The Assessment of Encoding Methods for Protein Secondary Structure Prediction

       3.5Assessments of Encoding Methods for Protein Fold Recognition

       3.6Conclusions

       References

       4.Remote Homology Detection

       4.1Motivation and Basic Idea

       4.2Related Work

       4.3Latent Semantic Analysis

       4.4Auto-cross Covariance Transformation

       4.5Conclusions

       References

       5.Structure Prediction

       5.1Motivation and Basic Idea

       5.2Related Work

       5.3Domain Boundary Prediction

       5.4Building Blocks of Protein Local Structure

       5.5Characterization of Protein Flexibility Based on Structural Alphabets

       5.6Novel Nonlinear Knowledge-based Mean Force Potentials

       5.7Conclusions

       References

       6.Function Prediction

       6.1Motivation and Basic Idea

       6.2Profile-level Interface Propensities for Binding Site Prediction

       6.3Gene Ontology-Based Protein Function Prediction

       6.4Prediction of Protein–Protein Interaction from Primary Sequences

       6.5Identifying the Missing Proteins using the Biological Language Model

       6.6Conclusions

       References

       7.Summary and Future Perspectives

       Index

       Chapter 1

       Introduction

       1.1Background and Motivation

      The task of human genome sequencing was completed in 2003, and life science from then on stepped into the post-gene era. The research focuses are gradually shifting from accumulating data to methods to interpret the data, i.e. how to extract structural and functional information from sequence data. Post-genome sequencing research includes comparative genomics, structural genomics, functional genomics, proteomics, holistic biology and pharmacogenomics.

      The proteome1 is a dynamic concept that is not only different in different tissues and different cells of the same organism but is constantly changing throughout the developmental stages of the same organism until the final demise of that organism. The complex pattern of gene expression leads to a variety of complex life activities. In fact, each form of movement in the stages of life is the result of different combinations of specific protein groups that appear at different times and spaces. The sequence of the