Sensitive clustering of 20 billion protein sequences at tree-of-life scale 
using DIAMOND2 DeepClust

Drost, H-G

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Talk

Sensitive clustering of 20 billion protein sequences at tree-of-life scale using DIAMOND2 DeepClust

MPS-Authors

/persons/resource/persons271796

Drost, H-G
Computational Biology Group, Department Molecular Biology, Max Planck Institute for Biology Tübingen, Max Planck Society;
Department Molecular Biology, Max Planck Institute for Biology Tübingen, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Drost, H.-G. (2023). Sensitive clustering of 20 billion protein sequences at tree-of-life scale using DIAMOND2 DeepClust. Talk presented at Max-Planck-Campus Tübingen: Distinguished Speaker Seminar Series (DSSS). Tübingen, Germany. 2023-06-02.

Cite as: https://hdl.handle.net/21.11116/0000-000D-37D8-2

Abstract

Our understanding of the origin and natural variation of the global biosphere is largely derived from morphological insights with data collections reaching back to the time of Aristotle. Sequencing the genomes and annotating the protein sequences across the tree of life will transform our access to evolutionary information and may provide a roadmap to characterizing the molecular principles underlying biodiversification. The key to accessing this reservoir of genomic information for molecular exploration and functional annotation is the comparative method, usually enabled by sequence similarity assessments. We introduce DIAMOND2 DeepClust, a ultra-fast and sensitive sequence clustering method optimized to perform protein sequence similarity clustering at low identity levels (e.g. down to 20% identity). Using DIAMOND2 DeepClust, we present an experimental study based on clustering the protein universe currently comprising of ~20 billion protein sequences and show how to overcome computational bottlenecks in the biosphere genomics era.