Using machine learning to predict and better understand DNA methylation and 
genomic enhancers

Huska, Matthew R.

doi:10.17169/refubium-8597

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Thesis

Using machine learning to predict and better understand DNA methylation and genomic enhancers

MPS-Authors

/persons/resource/persons73755

Huska, Matthew R.
IMPRS for Computational Biology and Scientific Computing - IMPRS-CBSC (Kirsten Kelleher), Dept. of Computational Molecular Biology (Head: Martin Vingron), Max Planck Institute for Molecular Genetics, Max Planck Society;
Fachbereich Mathematik und Informatik der Freien Universität Berlin;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Huska, M. R. (2017). Using machine learning to predict and better understand DNA methylation and genomic enhancers. PhD Thesis, Freie Universität, Berlin. doi:10.17169/refubium-8597.

Cite as: https://hdl.handle.net/21.11116/0000-0000-82D7-A

Abstract

In this thesis we explore the influence of DNA sequence on genomic elements that are involved in the regulation of genes. We approach this topic using tools from the field of machine learning, which provides established methods for identifying patterns in sequential data, in this case sequences of nucleotides. First, we show that the location and tissue-specificity of experimentally determined non-methylated regions of the genome can be predicted with high accuracy using the regions' DNA sequence. This analysis relies on new experimental methods that have been used to measure DNA methylation genome-wide, and their development has led to a shift away from relying on CpG islands as a proxy for non-methylated genomic regions. We demonstrate the high predictive performance of our method in two tissues across six different vertebrate species, as well as in ten human tissues, and show that the method we use outperforms existing methods that were designed to identify CpG islands. Next, we present a new approach to computationally predicting genomic enhancers. In contrast to existing methods, we combine the results of multiple complementary experimental methods to define the set of enhancers from which to learn patterns, and we use a machine learning method called co-training to enable us to incorporate this small set of high confidence enhancer regions as well as the rest of the genome into the training of our predictor. The enhancers are predicted based on both experimental data from ChIP-seq experiments and the DNA sequence of each region. We are able to show that our method achieves better predictive performance than other methods, and that co-training is particularly well suited for this problem because it is able to reduce the problem of overfitting.