Explainable deep learning models for biological sequence classification

Budach, Stefan

doi:10.17169/refubium-29866

Local TagsRelease HistoryDetailsSummary

Explainable deep learning models for biological sequence classification

Budach, S. (2019). Explainable deep learning models for biological sequence classification. PhD Thesis. doi:10.17169/refubium-29866.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-000F-0F09-8 Version Permalink: https://hdl.handle.net/21.11116/0000-000F-0F0A-7

Genre: Thesis

Files

show Files

Locators

show

Creators

show

hide

Creators:
Budach, Stefan^{1, 2}, Author
Marsico, Annalisa³, Referee

Affiliations:
1IMPRS for Biology and Computation (Anne-Dominique Gindrat), Dept. of Computational Molecular Biology (Head: Martin Vingron), Max Planck Institute for Molecular Genetics, Max Planck Society, ou_1479666
2Fachbereich Mathematik und Informatik der Freien Universität Berlin, ou_persistent22
3RNA Bioinformatics (Annalisa Marsico), Independent Junior Research Groups (OWL), Max Planck Institute for Molecular Genetics, Max Planck Society, ou_2117285

Content

show

hide

Free keywords: deep learning; interpretability; bioinformatics

Abstract: Biological sequences - DNA, RNA and proteins - orchestrate the behavior of all living cells and trying to understand the mechanisms that govern and regulate the interactions among these molecules has motivated biological research for many years. The introduction of experimental protocols that analyze such interactions on a genome- or transcriptome-wide scale has also established the usage of machine learning in our field to make sense of the vast amounts of generated data. Recently, deep learning, a branch of machine learning based on artificial neural networks, and especially convolutional neural networks (CNNs) were shown to deliver promising results for predictive tasks and automated feature extraction. However, the resulting models are often very complex and thus make model application and interpretation hard, but the possibility to interpret which features a model has learned from the data is crucial to understand and to explain new biological mechanisms.

This work therefore presents pysster, our open source software library that enables researchers to more easily train, apply and interpret CNNs on biological sequence data. We evaluate and implement different feature interpretation and visualization strategies and show that the flexibility of CNNs allows for the integration of additional data beyond pure sequences to improve the biological feature interpretability. We demonstrate this by building, among others, predictive models for transcription factor and RNA-binding protein binding sites and by supplementing these models with structural information in the form of DNA shape and RNA secondary structure. Features learned by models are then visualized as sequence and structure motifs together with information about motif locations and motif co-occurrence. By further analyzing an artificial data set containing implanted motifs we also illustrate how the hierarchical feature extraction process in a multi-layer deep neural network operates.

Finally, we present a larger biological application by predicting RNA-binding of proteins for transcripts for which experimental protein-RNA interaction data is not yet available. Here, the comprehensive interpretation options of CNNs made us aware of potential technical bias in the experimental eCLIP data (enhanced crosslinking and immunoprecipitation) that were used as a basis for the models. This allowed for subsequent tuning of the models and data to get more meaningful predictions in practice.

Details

show

hide

Language(s):

Dates: Accepted: 2019-12Published Online: 2021-03-29

Publication Status: Published online

Pages: vii, 119 S.

Publishing info: -

Table of Contents: -

Rev. Type: -

Identifiers: DOI: 10.17169/refubium-29866
URI: https://refubium.fu-berlin.de/handle/fub188/30124

Degree: PhD

Event

show

Legal Case

show

Project information

show

Source

show