ausblenden:
Schlagwörter:
-
Zusammenfassung:
The genesis of a structured and functional protein by random processes is exceedingly unlikely. However, once a functioning protein emerges, it can easily gain acceptance [1]. The evolution of natural proteins therefore often proceeds through the amplification of already established protein sequences. Copies of the same sequence evolve over time, leading to the co-existence of similar
sequences that might also have diversified in function [2]. We investigate the prevalence of such conservative evolution by analyzing reuse in the protein sequence universe. 1300 non-redundant bacterial genomes of distinct genera with exemplars from most bacterial classes are chosen as a representative for this study. We use statistical modeling in order to distinguish sequence similarities arising through reuse, as opposed to mere chance. For this purpose we derive the distribution of point mutation distances between randomly drawn k-mers. For long point mutation distances, the distribution can be described by a binomial distribution based on the amino acid composition of the underlying data. The frequency of shorter distances is significantly increased relative to the binomial distribution and can be explained by reuse. In the example of 100mers, we find that most sequence fragments (>90%) are at least reused once (p-value of 10-5). More than 10% of all sequence fragments are extensively reused and reoccur more than thousand times. Pairwise genome comparison reveals an overlap of around 19% common sequences on average. This demonstrates that the pressure to conserve sequences is strong enough to cause such significant sequence overlap, even after billions of years have passed.