MMseqs software suite for fast and deep clustering and searching of large 
protein sequence sets.

Hauser, M.; Steinegger, M.; Söding, J.

doi:10.1093/bioinformatics/btw006

Local TagsRelease HistoryDetailsSummary

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.

Hauser, M., Steinegger, M., & Söding, J. (2016). MMseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics, 32(9), 1323-1330. doi:10.1093/bioinformatics/btw006.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/11858/00-001M-0000-0029-4BB2-2 Version Permalink: https://hdl.handle.net/11858/00-001M-0000-002C-B89D-F

Genre: Journal Article

Files

show Files

hide Files

:

2241146.pdf (Publisher version), 633KB

File Permalink:
-

Name:
2241146.pdf

Description:
-

OA-Status:

Visibility:
Restricted (UNKNOWN id 303; )

MIME-Type / Checksum:
application/pdf

Technical Metadata:

Copyright Date:
-

Copyright Info:
-

License:
-

:

2241146-Suppl.pdf (Supplementary material), 2MB

View Save

File Permalink:
https://hdl.handle.net/11858/00-001M-0000-002C-7C4E-A

Name:
2241146-Suppl.pdf

Description:
-

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
-

License:
-

Locators

show

Creators

show

hide

Creators:
Hauser, M., Author
Steinegger, M.¹, Author
Söding, J.¹, Author

Affiliations:
1Research Group of Computational Biology, MPI for Biophysical Chemistry, Max Planck Society, ou_1933286

Content

show

hide

Free keywords: -

Abstract: Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence databases by similarity clustering improves speed and sensitivity of iterative searches. But existing tools cannot efficiently cluster databases of the size of UniProt to 50% maximum pairwise sequence identity or below. Furthermore, in metagenomics experiments typically large fractions of reads cannot be matched to any known sequence anymore because searching with sensitive but relatively slow tools (e.g. BLAST or HMMER3) through comprehensive databases such as UniProt is becoming too costly. RESULTS: MMseqs (Many-against-Many sequence searching) is a software suite for fast and deep clustering and searching of large datasets, such as UniProt, or 6-frame translated metagenomics sequencing reads. MMseqs contains three core modules: a fast and sensitive prefiltering module that sums up the scores of similar k-mers between query and target sequences, an SSE2- and multi-core-parallelized local alignment module, and a clustering module.In our homology detection benchmarks, MMseqs is much more sensitive and 4 to 30 times faster than UBLAST and RAPsearch, respectively, although it does not reach BLAST sensitivity yet. Using its cascaded clustering workflow, MMseqs can cluster large databases down to ~30% sequence identity at hundreds of times the speed of BLASTclust and much deeper than CD-HIT and USEARCH. MMseqs can also update a database clustering in linear instead of quadratic time. Its much improved sensitivity-speed trade-off should make MMseqs attractive for a wide range of large-scale sequence analysis tasks.

Details

show

hide

Language(s): eng - English

Dates: Published Online: 2016-01-06Date issued: 2016-05-01

Publication Status: Issued

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: Peer

Identifiers: DOI: 10.1093/bioinformatics/btw006

Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show

hide

Title: Bioinformatics

Source Genre: Journal

Creator(s):

Affiliations:

Publ. Info: -

Pages: - Volume / Issue: 32 (9) Sequence Number: - Start / End Page: 1323 - 1330 Identifier: -