English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
  Explainable deep learning models for biological sequence classification

Budach, S. (2019). Explainable deep learning models for biological sequence classification. PhD Thesis. doi:10.17169/refubium-29866.

Item is

Files

show Files

Locators

show

Creators

show
hide
 Creators:
Budach, Stefan1, 2, Author           
Marsico, Annalisa3, Referee           
Affiliations:
1IMPRS for Biology and Computation (Anne-Dominique Gindrat), Dept. of Computational Molecular Biology (Head: Martin Vingron), Max Planck Institute for Molecular Genetics, Max Planck Society, ou_1479666              
2Fachbereich Mathematik und Informatik der Freien Universität Berlin, ou_persistent22              
3RNA Bioinformatics (Annalisa Marsico), Independent Junior Research Groups (OWL), Max Planck Institute for Molecular Genetics, Max Planck Society, ou_2117285              

Content

show
hide
Free keywords: deep learning; interpretability; bioinformatics
 Abstract: Biological sequences - DNA, RNA and proteins - orchestrate the behavior of all living cells and trying to understand the mechanisms that govern and regulate the interactions among these molecules has motivated biological research for many years. The introduction of experimental protocols that analyze such interactions on a genome- or transcriptome-wide scale has also established the usage of machine learning in our field to make sense of the vast amounts of generated data. Recently, deep learning, a branch of machine learning based on artificial neural networks, and especially convolutional neural networks (CNNs) were shown to deliver promising results for predictive tasks and automated feature extraction. However, the resulting models are often very complex and thus make model application and interpretation hard, but the possibility to interpret which features a model has learned from the data is crucial to understand and to explain new biological mechanisms.

This work therefore presents pysster, our open source software library that enables researchers to more easily train, apply and interpret CNNs on biological sequence data. We evaluate and implement different feature interpretation and visualization strategies and show that the flexibility of CNNs allows for the integration of additional data beyond pure sequences to improve the biological feature interpretability. We demonstrate this by building, among others, predictive models for transcription factor and RNA-binding protein binding sites and by supplementing these models with structural information in the form of DNA shape and RNA secondary structure. Features learned by models are then visualized as sequence and structure motifs together with information about motif locations and motif co-occurrence. By further analyzing an artificial data set containing implanted motifs we also illustrate how the hierarchical feature extraction process in a multi-layer deep neural network operates.

Finally, we present a larger biological application by predicting RNA-binding of proteins for transcripts for which experimental protein-RNA interaction data is not yet available. Here, the comprehensive interpretation options of CNNs made us aware of potential technical bias in the experimental eCLIP data (enhanced crosslinking and immunoprecipitation) that were used as a basis for the models. This allowed for subsequent tuning of the models and data to get more meaningful predictions in practice.

Details

show
hide
Language(s):
 Dates: 2019-122021-03-29
 Publication Status: Published online
 Pages: vii, 119 S.
 Publishing info: -
 Table of Contents: -
 Rev. Type: -
 Degree: PhD

Event

show

Legal Case

show

Project information

show

Source

show