Exploiting protein language model sequence representations for repeat detection

Qiu, K; Dunin-Horkawicz, S; Lupas, AN

doi:10.1101/2024.06.07.596093

Local TagsRelease HistoryDetailsSummary

Exploiting protein language model sequence representations for repeat detection

Qiu, K., Dunin-Horkawicz, S., & Lupas, A. (submitted). Exploiting protein language model sequence representations for repeat detection.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-000F-640D-3 Version Permalink: https://hdl.handle.net/21.11116/0000-000F-B96A-A

Genre: Preprint

Files

show Files

Locators

show

Creators

show

hide

Creators:
Qiu, K¹, Author
Dunin-Horkawicz, S^{1, 2}, Author
Lupas, AN¹, Author

Affiliations:
1Department Protein Evolution, Max Planck Institute for Biology Tübingen, Max Planck Society, ou_3371683
2Structural Bioinformatics Group, Department Protein Evolution, Max Planck Institute for Biology Tübingen, Max Planck Society, ou_3606657

Content

show

hide

Free keywords: -

Abstract: Duplication is an essential evolutionary mechanism that operates at the scale of chromosomes, large chunks of DNA sequences, genes, protein domains, and shorter motifs. The study of duplication is central to understanding protein evolution, but the detection of repetitive sequence patterns is often challenging due to decreasing similarity between internal repeats resulting from long-term divergence. The most sensitive sequence-based repeat detection method, HHrepID, relies on the construction of multiple sequence alignments (MSAs) to enhance homology signals and thus facilitate the detection of very ancient duplications. However, such an alignment-based approach is slow and limits the ability to perform large-scale scans. Recent advances in protein representation learning have introduced sequence embeddings extracted from protein language models as a powerful and much faster alternative to MSAs. Protein sequence representations have been shown to be effective in homology detection, as exemplified by software such as our recently developed pLM-BLAST. In this study, we implement pLM-Repeat, a pipeline built upon pLM-BLAST, to identify repeats encoded in sequence embeddings. pLM-Repeat achieves comparable sensitivity to HHrepID in detecting the presence of repeats, while predicting many more repeat units and providing significantly better run times. We further trained a neural network DeepRepeat for the detection of domains that have patterns similar to well-characterized repeat folds to support fast filtering. Using our newly developed tools, we scanned the AFDB90v4 database and identified a collection of novel and undescribed repeat proteins.

Details

show

hide

Language(s):

Dates: Submitted: 2024-06

Publication Status: Submitted

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: -

Identifiers: DOI: 10.1101/2024.06.07.596093

Degree: -

Event

show

Legal Case

show

Project information

show

Source

show