English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
  Statistical methods for motif hit enrichment in DNA sequences

Kopp, W. (2016). Statistical methods for motif hit enrichment in DNA sequences. PhD Thesis. doi:10.17169/refubium-11714.

Item is

Files

show Files

Locators

show

Creators

show
hide
 Creators:
Kopp, Wolfgang1, 2, Author           
Vingron, Martin3, Referee                 
Affiliations:
1IMPRS for Computational Biology and Scientific Computing - IMPRS-CBSC (Kirsten Kelleher), Dept. of Computational Molecular Biology (Head: Martin Vingron), Max Planck Institute for Molecular Genetics, Max Planck Society, ou_1479666              
2Department of Mathematics and Computer Science, FU Berlin, ou_persistent22              
3Transcriptional Regulation (Martin Vingron), Dept. of Computational Molecular Biology (Head: Martin Vingron), Max Planck Institute for Molecular Genetics, Max Planck Society, ou_1479639              

Content

show
hide
Free keywords: transcription factor binding sites; motif match statistics; self-overlapping match statistics; number of motif matches
 Abstract: In this thesis, we discuss methods for analyzing the non-coding sequence of the genome (e.g promoters) with respect to the identification and enrichment of transcription factor binding sites (TFBSs), as they are related to gene regulation. The identification of pu- tative TFBSs is based on the log- likelihood ratio between a TF motif, which describes the binding affinity of a TF towards the DNA, and a background model, which is im- plemented by an order-d Markov models with d ≥ 0, in conjunction with a pre-defined log- likelihood ratio threshold. Chapter 2 reviews algorithms for computing the false positive probability of calling motif hits for a given threshold. As putative TFBSs can self-overlap one another, which affects the enrichment test of the number of TFBSs, we discuss the quantification of overlapping TFBS predictions in Chapter 3. In Chapter 4, we discuss a compound Poisson model for modeling the distribution of the number of TFBSs in both strands of the DNA sequence, which represents an extension of Pape et al. [36]. The main advance of our model regards the use of newly derived princi- pal overlapping hit probabilities, which are motivated by the discussion of principal periods in Reinert et al. [41], as well as by facilitating the use higher-order Markov models for the background. In Chapter 5 we discuss a novel Markov model which is utilized to determine the probability of a TFBS occurrence that does not overlap a previ- ous TFBS occurrences, termed clump start probability, which mark the beginning of a clump. The resulting clump start probability then serves as an important building block for the subsequent Chapter 6. Finally, in Chapter 6 we present a novel combinatorial model for the distribution of the number of motif hit. To that end, we efficiently sum up the probabilities of all realizations of placing x TFBSs in a finite-length sequence of length N. We systematically compared the accuracy of the combinatorial model, the compound Poisson model and the binomial model. An implementation of the algo- rithms that were discussed in this thesis is provided as an R package that is available at https://github.com/wkopp/mdist.

Details

show
hide
Language(s): eng - English
 Dates: 20162017-06-12
 Publication Status: Published online
 Pages: 216 S.
 Publishing info: -
 Table of Contents: -
 Rev. Type: -
 Degree: PhD

Event

show

Legal Case

show

Project information

show

Source

show