Algorithms in Bioinformatics. Paul A. Gagniuc. Читать онлайн. Hotlib. HOTLIB.NET

Algorithms in Bioinformatics

deltocephalinicola with a genome of 112 kbp (0.11 Mbp) [172, 173]. The eukaryotes with the smallest nuclear genome necessary for life are found in the kingdom of fungi. The spore-forming unicellular parasite Encephalitozoon intestinalis shows a genome size of ∼2.3 Mbp and a total of 1.8k protein-coding genes [174]. Nonetheless, the smallest free-living eukaryote is Ostreococcus tauri, a marine green alga with a diameter of about 0.8 μm and a genome size of 12.6 Mbp (8.2k protein-coding genes) [175].

2.3.1 Alternative Methods

The data mentioned above were determined by DNA sequencing approaches made so far. DNA sequencing is an ongoing process for several decades and the species chosen for sequencing are usually either of economic or research importance (or even of historical significance). There are many species that have not yet been sequenced, either due to their minor importance to humans or due to large genomes that cannot be easily managed. Usually, the size of the genetic material can be estimated by methods other than sequencing. One of these methods is flow cytometry, which estimates the weight of the genetic material [176]. This weight, expressed in picograms (pg), can then be converted to base pairs. One picogram is equal to 978 megabase pairs (1 pg = 978 Mbp) [177]. For instance, Paris japonica (flower) shows a genome weight of 152.23 pg, which suggests a genome size of 148 880 Mbp (152.23 pg × 978 Mbp = 149 Gbp) [178].

2.3.2 The Weaving of Scales

To get a sense of genome size closer to our reference system, some transformations can express the mega base pairs as physical lengths. The linear length of a double-stranded DNA (dsDNA) molecule can be calculated by multiplying the average distance between bases (∼3.4 angstrom = 0.34 nm [179, 180]; 1 angstrom = 0.1 nm) by the total number of base pairs in a genome. Here, genomes are expressed in mega base pairs. Since 1Mbp is equal to one million base pairs, the size of a genome can be multiplied by one million and then multiplied further by the average distance between bases (0.34 nm). One meter is equal to 1 000 000 000 nanometers (1 × 10⁹). Thus, the result expressed in nanometers is divided by 1 × 10⁹ for conversion to meters.

Depending on the organism, cells of different tissues can be characterized based on the number of sets of chromosomes present: monoploid (one set of chromosomes), diploid (two sets), triploid (three sets), tetraploid (four sets), pentaploid (five sets), and so on. For instance, the human genome contains 3.1 Gbp (3100 Mbp). Thus, in a human haploid (or monoploid) cell (e.g. a single set of chromosomes found in a gamete), the unfolded length of a single set of chromosomes, arranged linearly one after the other, would show an approximate length of:

Thus, a single set of human chromosomes (n = 23 Chr) can theoretically unfold up to 1 m. However, the human body is constituted mainly of somatic cells (diploid cells – two sets of chromosomes/cell). For a diploid cell (2n = 46 Chr), the linear length of all 46 dsDNA molecules is calculated as above and the result in multiplied by two:

Therefore, the two sets (2n = 46 Chr) of human chromosomes found inside a somatic cell can theoretically unfold up to 2.1 m. The linear length of dsDNA molecules from all chromosomes of a somatic cell and the estimated average number of somatic cells in the human body, can be used for various mental experiments (e.g. comparisons between DNA lengths and cosmic distances). These calculations can be empirically extended for ssDNA molecules placed linearly one after the other. For instance, the 2.1 m of dsDNA from a somatic cell, of course, doubles if the ssDNA approach is considered (2.1 m × 2 DNA strands = 4.2 m of ssDNA). The implementation found in Additional algorithm 2.1 uses the above formula to convert the number of bases of a genome to physical length expressed in meters. Important: For convenience, from this point on all notations “b”, “kb”, “Mb”, “Gb” will refer to dsDNA (double stranded DNA).

Additional algorithm 2.1 Note that the source code is in context and works with copy/paste.

Above, the example is given on Homo sapiens and the result shows the calculated total length of unfolded chromosomes for both haploid cells and diploid (somatic) cells. This computation can be applied to all genomes mentioned so far by calling function f repeatedly. Thus, Additional algorithm 2.1 is extended to perform this calculation for an arbitrary number of species (Additional algorithm 2.2).

Additional algorithm 2.2 Note that the source code is in context and works with copy/paste.

To call function f repeatedly, a parsing-based method is used. Above, variable a contains a series of records. The structure of these records is based on two delimiters, namely: “|” and “Mb.” Delimiter “|” separates the species name ( r[0] ) from the size of the genome ( r[1] ), while the “Mb” delimiter separates the records from each other ( t[u] ). Please note that 0.001 m equals 1 mm. For instance, the output of Additional algorithm 2.2 shows that Escherichia coli contains a genome of ∼1.6 mm in length (0.0016 m), or that E. intestinalis contains a genome of 0.78 mm in length (0.00078 m).

2.3.3 Computations on the Average Genome Size

A series of computations show the average genome size observed for each division in the tree of life, as well as the average size of viral genomes and the average DNA length of plasmids (Figure 2.1 and Table 2.1). These values were calculated from the raw data extracted from the file transfer protocol (FTP) of the National Center for Biotechnology Information (NCBI). The NCBI section for Genome Information by Organism contains general data in relation to each branch from the tree of life: eukaryotes (13k); prokaryotes (265k); viruses

Скачать книгу