Help Privacy Policy Disclaimer
  Advanced SearchBrowse





Using machine learning to predict and better understand DNA methylation and genomic enhancers


Huska,  Matthew R.       
IMPRS for Computational Biology and Scientific Computing - IMPRS-CBSC (Kirsten Kelleher), Dept. of Computational Molecular Biology (Head: Martin Vingron), Max Planck Institute for Molecular Genetics, Max Planck Society;
Fachbereich Mathematik und Informatik der Freien Universität Berlin;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)
There are no public fulltexts stored in PuRe
Supplementary Material (public)
There is no public supplementary material available

Huska, M. R. (2017). Using machine learning to predict and better understand DNA methylation and genomic enhancers. PhD Thesis, Freie Universität, Berlin. doi:10.17169/refubium-8597.

Cite as: https://hdl.handle.net/21.11116/0000-0000-82D7-A
In this thesis we explore the influence of DNA sequence on genomic elements that are involved in the regulation of genes. We approach this topic using tools from the field of machine learning, which provides established methods for identifying patterns in sequential data, in this case sequences of nucleotides. First, we show that the location and tissue-specificity of experimentally determined non-methylated regions of the genome can be predicted with high accuracy using the regions' DNA sequence. This analysis relies on new experimental methods that have been used to measure DNA methylation genome-wide, and their development has led to a shift away from relying on CpG islands as a proxy for non-methylated genomic regions. We demonstrate the high predictive performance of our method in two tissues across six different vertebrate species, as well as in ten human tissues, and show that the method we use outperforms existing methods that were designed to identify CpG islands. Next, we present a new approach to computationally predicting genomic enhancers. In contrast to existing methods, we combine the results of multiple complementary experimental methods to define the set of enhancers from which to learn patterns, and we use a machine learning method called co-training to enable us to incorporate this small set of high confidence enhancer regions as well as the rest of the genome into the training of our predictor. The enhancers are predicted based on both experimental data from ChIP-seq experiments and the DNA sequence of each region. We are able to show that our method achieves better predictive performance than other methods, and that co-training is particularly well suited for this problem because it is able to reduce the problem of overfitting.