Statistical methods for motif hit enrichment in DNA sequences

Kopp, Wolfgang

doi:10.17169/refubium-11714

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Thesis

Statistical methods for motif hit enrichment in DNA sequences

MPS-Authors

/persons/resource/persons73757

Kopp, Wolfgang
IMPRS for Computational Biology and Scientific Computing - IMPRS-CBSC (Kirsten Kelleher), Dept. of Computational Molecular Biology (Head: Martin Vingron), Max Planck Institute for Molecular Genetics, Max Planck Society;
Department of Mathematics and Computer Science, FU Berlin;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Kopp, W. (2016). Statistical methods for motif hit enrichment in DNA sequences. PhD Thesis. doi:10.17169/refubium-11714.

Cite as: https://hdl.handle.net/21.11116/0000-0000-82B9-C

Abstract

In this thesis, we discuss methods for analyzing the non-coding sequence of the genome (e.g promoters) with respect to the identification and enrichment of transcription factor binding sites (TFBSs), as they are related to gene regulation. The identification of pu- tative TFBSs is based on the log- likelihood ratio between a TF motif, which describes the binding affinity of a TF towards the DNA, and a background model, which is im- plemented by an order-d Markov models with d ≥ 0, in conjunction with a pre-defined log- likelihood ratio threshold. Chapter 2 reviews algorithms for computing the false positive probability of calling motif hits for a given threshold. As putative TFBSs can self-overlap one another, which affects the enrichment test of the number of TFBSs, we discuss the quantification of overlapping TFBS predictions in Chapter 3. In Chapter 4, we discuss a compound Poisson model for modeling the distribution of the number of TFBSs in both strands of the DNA sequence, which represents an extension of Pape et al. [36]. The main advance of our model regards the use of newly derived princi- pal overlapping hit probabilities, which are motivated by the discussion of principal periods in Reinert et al. [41], as well as by facilitating the use higher-order Markov models for the background. In Chapter 5 we discuss a novel Markov model which is utilized to determine the probability of a TFBS occurrence that does not overlap a previ- ous TFBS occurrences, termed clump start probability, which mark the beginning of a clump. The resulting clump start probability then serves as an important building block for the subsequent Chapter 6. Finally, in Chapter 6 we present a novel combinatorial model for the distribution of the number of motif hit. To that end, we efficiently sum up the probabilities of all realizations of placing x TFBSs in a finite-length sequence of length N. We systematically compared the accuracy of the combinatorial model, the compound Poisson model and the binomial model. An implementation of the algo- rithms that were discussed in this thesis is provided as an R package that is available at https://github.com/wkopp/mdist.