English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Talk

Sensitive clustering of 20 billion protein sequences at tree-of-life scale using DIAMOND2 DeepClust

MPS-Authors
/persons/resource/persons271796

Drost,  H-G       
Computational Biology Group, Department Molecular Biology, Max Planck Institute for Biology Tübingen, Max Planck Society;
Department Molecular Biology, Max Planck Institute for Biology Tübingen, Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)
There are no public fulltexts stored in PuRe
Supplementary Material (public)
There is no public supplementary material available
Citation

Drost, H.-G. (2023). Sensitive clustering of 20 billion protein sequences at tree-of-life scale using DIAMOND2 DeepClust. Talk presented at Max-Planck-Campus Tübingen: Distinguished Speaker Seminar Series (DSSS). Tübingen, Germany. 2023-06-02.


Cite as: https://hdl.handle.net/21.11116/0000-000D-37D8-2
Abstract
Our understanding of the origin and natural variation of the global biosphere is largely derived from morphological insights with data collections reaching back to the time of Aristotle. Sequencing the genomes and annotating the protein sequences across the tree of life will transform our access to evolutionary information and may provide a roadmap to characterizing the molecular principles underlying biodiversification. The key to accessing this reservoir of genomic information for molecular exploration and functional annotation is the comparative method, usually enabled by sequence similarity assessments. We introduce DIAMOND2 DeepClust, a ultra-fast and sensitive sequence clustering method optimized to perform protein sequence similarity clustering at low identity levels (e.g. down to 20% identity). Using DIAMOND2 DeepClust, we present an experimental study based on clustering the protein universe currently comprising of ~20 billion protein sequences and show how to overcome computational bottlenecks in the biosphere genomics era.