Help Privacy Policy Disclaimer
  Advanced SearchBrowse




Conference Paper

The SYSTERS protein family database: taxon-related protein family size distributions and singleton frequencies


Meinel,  Thomas
Max Planck Society;


Vingron,  Martin
Gene regulation (Martin Vingron), Dept. of Computational Molecular Biology (Head: Martin Vingron), Max Planck Institute for Molecular Genetics, Max Planck Society;

Krause,  Antje
Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)

(Any fulltext), 162KB

Supplementary Material (public)
There is no public supplementary material available

Meinel, T., Vingron, M., & Krause, A. (2003). The SYSTERS protein family database: taxon-related protein family size distributions and singleton frequencies. In H.-W. Mewes, D. Frishman, V. Heun, & S. Kramer (Eds.), Proceedings of the German Conference on Bioinformatics (GCB '03) (pp. 103-108).

Cite as: https://hdl.handle.net/11858/00-001M-0000-0010-8B2A-A
Based on the SYSTERS protein family database, we present taxon-related protein family frequencies and distributions. A set of taxon-related protein families is a subset of the whole family set with respect to one taxon, where taxon is not restricted to the species level but may be any rank in the taxonomy. We examine eight ranks in the lineages of seven organisms. A strong linear correlation is observed between the total number of different families and the number of sequences in the data set under consideration. We fitted the generalised power-law function to protein family distributions in a least-squares sense excluding singleton frequencies. Taxon-related family distributions tend to have the same shape and a negative slope being not larger than -2.1 for large data sets. For smaller data sets, the slope is decreasing down to -3.7. Slopes of family distributions are found to be slowly increasing towards higher taxonomic ranks. Our observations lead to a new estimation of single sequence cluster frequencies. Data sets of various species are studied with respect to being complete or incomplete.