Группа авторов

Systematics and the Exploration of Life


Скачать книгу

predicted value of ΔΔG from the mutant to the native.

      In addition, protein structures of identical sequences can vary greatly due to different protein–protein interactions, interactions with different ligands or solvents (Kosloff and Kolodny 2008). However, provided that a sufficient number of structures are available, the effect of the mutation can be distinguished from “noise”, in other words, from statistical fluctuations due to other sources, such as exposure to solvents or belonging to a secondary structure (Shanthirabalan et al. 2018). This requires measuring the variations between the native and mutated proteins, and considering both the global variability and local flexibility of the structure.

Schematic illustration of procedure for calculating local RMSDs.

      COMMENT ON FIGURE 2.6.a) Superimposition of two lysozyme structures. b) Their difference is measured by the RMSD calculated for the fragments of three successive residues, such as the fragment in blue. The RMSD is the root of the sum of the N distances D between the alpha carbon pairs ai and b’i (i.e. bi after superimposition) divided by the N number of Cα pairs (three in our case). This calculation is performed on the whole protein. There are as many RMSDs as residues in the protein (except for the two Cαs at the extremities); a profile is obtained (graph below). The mutation is localized at the location of the cross (protein 2hef chain A, mutation I89A).

Bar chart depicts the RMSD calculated for 78 mutants of a transferase, the reference being that of Pyrococcus horikoshii (PDB code 2dek chain A). Bar charts depict the distribution of RMSD, p-values and p-ranks.

      In order to take into account the global variability due to variations in experimental conditions, the gross RMSD should not be used, but a transformation of it. Considering ranks instead of values is a robust transformation used in many statistical tests. The RMSDs in each profile are first ranked in ascending order, and then the ranks are divided by the number of RMSDs in the profile (in other words, the length of the chain). The result is dimensionless, stacked values that allow the characterization of each protein in the family. If the mutations had no particular effect on the RMSD, the distribution of these p-values should be uniform, which is not what is observed (Figure 2.8(b)). This first transformation allows the experimental variability, but not the intrinsic flexibility of the molecule, to be taken into account. Indeed, in very flexible regions, seeing as the RMSD is large, the first ranking, and thus the empirical p-value, will also always be large. It is therefore necessary to make a second classification, that of the empirical p-value, for each position in each family. The new empirical p-value is then called the p-rank in order to differentiate it from the first one.

      The place where the mutation takes place is the one most likely to be disturbed, at least in intensity (with a high RMSD value). Among all the calculated RMSDs, there are 580 positions corresponding to a mutation. Also, the p-rank distribution of RMSDs centered on mutated residues is not uniform (Figure 2.8(b)). Among the top 5% of the largest RMSDs, 12% are mutation-centered; among the top 5% of empirical p-ranks, 15% are mutation-centered;