kClust: Fast and sensitive clustering of large protein sequence databases.

Hauser, M.; Mayer, C. E.; Söding, J.

doi:10.1186/1471-2105-14-248

Local TagsRelease HistoryDetailsSummary

kClust: Fast and sensitive clustering of large protein sequence databases.

Hauser, M., Mayer, C. E., & Söding, J. (2013). kClust: Fast and sensitive clustering of large protein sequence databases. BMC Bioinformatics, 14: 248. doi:10.1186/1471-2105-14-248.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/11858/00-001M-0000-0015-8CCE-6 Version Permalink: https://hdl.handle.net/11858/00-001M-0000-0027-C587-1

Genre: Journal Article

Files

show Files

hide Files

:

1944212.pdf (Publisher version), 3MB

View Save

File Permalink:
https://hdl.handle.net/11858/00-001M-0000-0015-8CD0-F

Name:
1944212.pdf

Description:
-

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
-

License:
-

Locators

show

hide

Locator:
http://www.biomedcentral.com/content/pdf/1471-2105-14-248.pdf (Publisher version) Open Access status unknown

Description:
-

OA-Status:

Creators

show

hide

Creators:
Hauser, M.¹, Author
Mayer, C. E.¹, Author
Söding, J.², Author

Affiliations:
1external, ou_persistent22
2Research Group of Computational Biology, MPI for Biophysical Chemistry, Max Planck Society, ou_1933286

Content

show

hide

Free keywords: -

Abstract: Background Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable. Results Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed. Conclusions kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at http://toolkit.lmb.uni-muenchen.de/pub/kClust/ webcite.

Details

show

hide

Language(s): eng - English

Dates: Published Online: 2013Date issued: 2013-08-15

Publication Status: Issued

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: Peer

Identifiers: DOI: 10.1186/1471-2105-14-248

Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show

hide

Title: BMC Bioinformatics

Source Genre: Journal

Creator(s):

Affiliations:

Publ. Info: BioMed Central

Pages: - Volume / Issue: 14 Sequence Number: 248 Start / End Page: - Identifier: ISSN: 1471-2105
CoNE: https://pure.mpg.de/cone/journals/resource/111000136905000