Assessing structure and disorder prediction tools for de novo emerged proteins 
in the age of machine learning

Aubel, M; Eicholt, L; Bornberg-Bauer, E

doi:10.12688/f1000research.130443.1

Local TagsRelease HistoryDetailsSummary

Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning

Aubel, M., Eicholt, L., & Bornberg-Bauer, E. (2023). Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning. F1000Research. doi:10.12688/f1000research.130443.1.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-000C-E06E-C Version Permalink: https://hdl.handle.net/21.11116/0000-000C-E06F-B

Genre: Journal Article

Files

show Files

Locators

show

Creators

show

hide

Creators:
Aubel, M, Author
Eicholt, L, Author
Bornberg-Bauer, E¹, Author

Affiliations:
1Department Protein Evolution, Max Planck Institute for Developmental Biology, Max Planck Society, ou_3375791

Content

show

hide

Free keywords: -

Abstract: Background: De novo protein coding genes emerge from scratch in the non-coding regions of the genome and have, per definition, no homology to other genes. Therefore, their encoded de novo proteins belong to the so-called "dark protein space". So far, only four de novo protein structures have been experimentally approximated. Low homology, presumed high disorder and limited structures result in low confidence structural predictions for de novo proteins in most cases. Here, we look at the most widely used structure and disorder predictors and assess their applicability for de novo emerged proteins. Since AlphaFold2 is based on the generation of multiple sequence alignments and was trained on solved structures of largely conserved and globular proteins, its performance on de novo proteins remains unknown. More recently, natural language models of proteins have been used for alignment-free structure predictions, potentially making them more suitable for de novo proteins than AlphaFold2.
Methods: We applied different disorder predictors (IUPred3 short/long, flDPnn) and structure predictors, AlphaFold2 on the one hand and language-based models (Omegafold, ESMfold, RGN2) on the other hand, to four de novo proteins with experimental evidence on structure. We compared the resulting predictions between the different predictors as well as to the existing experimental evidence.
Results: Results from IUPred, the most widely used disorder predictor, depend heavily on the choice of parameters and differ significantly from flDPnn which has been found to outperform most other predictors in a comparative assessment study recently. Similarly, different structure predictors yielded varying results and confidence scores for de novo proteins.
Conclusions: We suggest that, while in some cases protein language model based approaches might be more accurate than AlphaFold2, the structure prediction of de novo emerged proteins remains a difficult task for any predictor, be it disorder or structure.

Details

show

hide

Language(s):

Dates: Published Online: 2023-03

Publication Status: Published online

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: -

Identifiers: DOI: 10.12688/f1000research.130443.1

Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show

hide

Title: F1000Research

Abbreviation : F1000Research

Source Genre: Journal

Creator(s):

Affiliations:

Publ. Info: London : BioMed Central

Pages: - Volume / Issue: - Sequence Number: - Start / End Page: - Identifier: ISSN: 2046-1402
CoNE: https://pure.mpg.de/cone/journals/resource/2046-1402