English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
EndNote (UTF-8)
 
DownloadE-Mail
  Endowing protein language models with structural knowledge

Chen, D., Hartout, P., Pellizzoni, P., Oliver, C., & Borgwardt, K. (2024). Endowing protein language models with structural knowledge. arXiv: Condensed Matter-Materials Science. doi:10.48550/arXiv.2401.14819.

Item is

Files

show Files

Locators

show

Creators

hide
 Creators:
Chen, Dexiong1, Author           
Hartout, Philip1, Author           
Pellizzoni, Paolo1, Author           
Oliver, Carlos1, Author           
Borgwardt, Karsten1, Author                 
Affiliations:
1Borgwardt, Karsten / Machine Learning and Systems Biology, Max Planck Institute of Biochemistry, Max Planck Society, ou_3502542              

Content

hide
Free keywords: Quantitative Biology, Quantitative Methods, q-bio.QM,Computer Science, Learning, cs.LG,Quantitative Biology, Biomolecules, q-bio.BM
 Abstract: Understanding the relationships between protein sequence, structure and
function is a long-standing biological challenge with manifold implications
from drug design to our understanding of evolution. Recently, protein language
models have emerged as the preferred method for this challenge, thanks to their
ability to harness large sequence databases. Yet, their reliance on expansive
sequence data and parameter sets limits their flexibility and practicality in
real-world scenarios. Concurrently, the recent surge in computationally
predicted protein structures unlocks new opportunities in protein
representation learning. While promising, the computational burden carried by
such complex data still hinders widely-adopted practical applications. To
address these limitations, we introduce a novel framework that enhances protein
language models by integrating protein structural data. Drawing from recent
advances in graph transformers, our approach refines the self-attention
mechanisms of pretrained language transformers by integrating structural
information with structure extractor modules. This refined model, termed
Protein Structure Transformer (PST), is further pretrained on a small protein
structure database, using the same masked language modeling objective as
traditional protein language models. Empirical evaluations of PST demonstrate
its superior parameter efficiency relative to protein language models, despite
being pretrained on a dataset comprising only 542K structures. Notably, PST
consistently outperforms the state-of-the-art foundation model for protein
sequences, ESM-2, setting a new benchmark in protein function prediction. Our
findings underscore the potential of integrating structural information into
protein language models, paving the way for more effective and efficient
protein modeling Code and pretrained models are available at
https://github.com/BorgwardtLab/PST.

Details

hide
Language(s):
 Dates: 2024-01-262024-01-24
 Publication Status: Published online
 Pages: -
 Publishing info: -
 Table of Contents: -
 Rev. Type: -
 Identifiers: DOI: 10.48550/arXiv.2401.14819
 Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

hide
Title: arXiv: Condensed Matter-Materials Science
  Abbreviation : cond-mat.mtrl-sci
Source Genre: Journal
 Creator(s):
Affiliations:
Publ. Info: -
Pages: - Volume / Issue: - Sequence Number: - Start / End Page: - Identifier: ISSN: arXiv:1701.06694
CoNE: https://pure.mpg.de/cone/journals/resource/arXiv:1701.06694