Statistical methods for motif hit enrichment in DNA sequences

Kopp, Wolfgang

doi:10.17169/refubium-11714

Local TagsRelease HistoryDetailsSummary

Statistical methods for motif hit enrichment in DNA sequences

Kopp, W. (2016). Statistical methods for motif hit enrichment in DNA sequences. PhD Thesis. doi:10.17169/refubium-11714.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-0000-82B9-C Version Permalink: https://hdl.handle.net/21.11116/0000-000F-13E1-D

Genre: Thesis

Files

show Files

Locators

show

Creators

show

hide

Creators:
Kopp, Wolfgang^{1, 2}, Author
Vingron, Martin³, Referee

Affiliations:
1IMPRS for Computational Biology and Scientific Computing - IMPRS-CBSC (Kirsten Kelleher), Dept. of Computational Molecular Biology (Head: Martin Vingron), Max Planck Institute for Molecular Genetics, Max Planck Society, ou_1479666
2Department of Mathematics and Computer Science, FU Berlin, ou_persistent22
3Transcriptional Regulation (Martin Vingron), Dept. of Computational Molecular Biology (Head: Martin Vingron), Max Planck Institute for Molecular Genetics, Max Planck Society, ou_1479639

Content

show

hide

Free keywords: transcription factor binding sites; motif match statistics; self-overlapping match statistics; number of motif matches

Abstract: In this thesis, we discuss methods for analyzing the non-coding sequence of the genome (e.g promoters) with respect to the identification and enrichment of transcription factor binding sites (TFBSs), as they are related to gene regulation. The identification of pu- tative TFBSs is based on the log- likelihood ratio between a TF motif, which describes the binding affinity of a TF towards the DNA, and a background model, which is im- plemented by an order-d Markov models with d ≥ 0, in conjunction with a pre-defined log- likelihood ratio threshold. Chapter 2 reviews algorithms for computing the false positive probability of calling motif hits for a given threshold. As putative TFBSs can self-overlap one another, which affects the enrichment test of the number of TFBSs, we discuss the quantification of overlapping TFBS predictions in Chapter 3. In Chapter 4, we discuss a compound Poisson model for modeling the distribution of the number of TFBSs in both strands of the DNA sequence, which represents an extension of Pape et al. [36]. The main advance of our model regards the use of newly derived princi- pal overlapping hit probabilities, which are motivated by the discussion of principal periods in Reinert et al. [41], as well as by facilitating the use higher-order Markov models for the background. In Chapter 5 we discuss a novel Markov model which is utilized to determine the probability of a TFBS occurrence that does not overlap a previ- ous TFBS occurrences, termed clump start probability, which mark the beginning of a clump. The resulting clump start probability then serves as an important building block for the subsequent Chapter 6. Finally, in Chapter 6 we present a novel combinatorial model for the distribution of the number of motif hit. To that end, we efficiently sum up the probabilities of all realizations of placing x TFBSs in a finite-length sequence of length N. We systematically compared the accuracy of the combinatorial model, the compound Poisson model and the binomial model. An implementation of the algo- rithms that were discussed in this thesis is provided as an R package that is available at https://github.com/wkopp/mdist.

Details

show

hide

Language(s): eng - English

Dates: Accepted: 2016Published Online: 2017-06-12

Publication Status: Published online

Pages: 216 S.

Publishing info: -

Table of Contents: -

Rev. Type: -

Identifiers: DOI: 10.17169/refubium-11714
URI: https://refubium.fu-berlin.de/handle/fub188/7515

Degree: PhD

Event

show

Legal Case

show

Project information

show

Source

show