hide
Free keywords:
Quantitative Biology, Quantitative Methods, q-bio.QM,Computer Science, Learning, cs.LG,Quantitative Biology, Biomolecules, q-bio.BM
Abstract:
Understanding the relationships between protein sequence, structure and
function is a long-standing biological challenge with manifold implications
from drug design to our understanding of evolution. Recently, protein language
models have emerged as the preferred method for this challenge, thanks to their
ability to harness large sequence databases. Yet, their reliance on expansive
sequence data and parameter sets limits their flexibility and practicality in
real-world scenarios. Concurrently, the recent surge in computationally
predicted protein structures unlocks new opportunities in protein
representation learning. While promising, the computational burden carried by
such complex data still hinders widely-adopted practical applications. To
address these limitations, we introduce a novel framework that enhances protein
language models by integrating protein structural data. Drawing from recent
advances in graph transformers, our approach refines the self-attention
mechanisms of pretrained language transformers by integrating structural
information with structure extractor modules. This refined model, termed
Protein Structure Transformer (PST), is further pretrained on a small protein
structure database, using the same masked language modeling objective as
traditional protein language models. Empirical evaluations of PST demonstrate
its superior parameter efficiency relative to protein language models, despite
being pretrained on a dataset comprising only 542K structures. Notably, PST
consistently outperforms the state-of-the-art foundation model for protein
sequences, ESM-2, setting a new benchmark in protein function prediction. Our
findings underscore the potential of integrating structural information into
protein language models, paving the way for more effective and efficient
protein modeling Code and pretrained models are available at
https://github.com/BorgwardtLab/PST.