English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Preprint

Endowing protein language models with structural knowledge

MPS-Authors
/persons/resource/persons298953

Chen,  Dexiong
Borgwardt, Karsten / Machine Learning and Systems Biology, Max Planck Institute of Biochemistry, Max Planck Society;

/persons/resource/persons298951

Hartout,  Philip
Borgwardt, Karsten / Machine Learning and Systems Biology, Max Planck Institute of Biochemistry, Max Planck Society;

/persons/resource/persons291271

Pellizzoni,  Paolo
Borgwardt, Karsten / Machine Learning and Systems Biology, Max Planck Institute of Biochemistry, Max Planck Society;

/persons/resource/persons294075

Oliver,  Carlos
Borgwardt, Karsten / Machine Learning and Systems Biology, Max Planck Institute of Biochemistry, Max Planck Society;

/persons/resource/persons75313

Borgwardt,  Karsten       
Borgwardt, Karsten / Machine Learning and Systems Biology, Max Planck Institute of Biochemistry, Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)
There are no public fulltexts stored in PuRe
Supplementary Material (public)
There is no public supplementary material available
Citation

Chen, D., Hartout, P., Pellizzoni, P., Oliver, C., & Borgwardt, K. (2024). Endowing protein language models with structural knowledge. arXiv: Condensed Matter-Materials Science. doi:10.48550/arXiv.2401.14819.


Cite as: https://hdl.handle.net/21.11116/0000-000F-6452-4
Abstract
Understanding the relationships between protein sequence, structure and
function is a long-standing biological challenge with manifold implications
from drug design to our understanding of evolution. Recently, protein language
models have emerged as the preferred method for this challenge, thanks to their
ability to harness large sequence databases. Yet, their reliance on expansive
sequence data and parameter sets limits their flexibility and practicality in
real-world scenarios. Concurrently, the recent surge in computationally
predicted protein structures unlocks new opportunities in protein
representation learning. While promising, the computational burden carried by
such complex data still hinders widely-adopted practical applications. To
address these limitations, we introduce a novel framework that enhances protein
language models by integrating protein structural data. Drawing from recent
advances in graph transformers, our approach refines the self-attention
mechanisms of pretrained language transformers by integrating structural
information with structure extractor modules. This refined model, termed
Protein Structure Transformer (PST), is further pretrained on a small protein
structure database, using the same masked language modeling objective as
traditional protein language models. Empirical evaluations of PST demonstrate
its superior parameter efficiency relative to protein language models, despite
being pretrained on a dataset comprising only 542K structures. Notably, PST
consistently outperforms the state-of-the-art foundation model for protein
sequences, ESM-2, setting a new benchmark in protein function prediction. Our
findings underscore the potential of integrating structural information into
protein language models, paving the way for more effective and efficient
protein modeling Code and pretrained models are available at
https://github.com/BorgwardtLab/PST.