hide
Free keywords:
-
Abstract:
Elucidating the mechanisms of transcriptional regulation relies heavily on the sequence annotation
of the binding sites of DNA-binding proteins called transcription factors. With
the rationale that binding sites conserved across di erent species are more likely to be functional,
the standard approach is to employ cross-species comparisons and focus the search
to conserved regions. Usually, computational methods that annotate conserved binding
sites perform the alignment and binding site annotation steps separately and combine the
results in the end. If the binding site descriptions are weak or the sequence similarity is
low, the local gap structure of the alignment poses a problem in detecting the conserved
sites. In this thesis, we introduce a novel method that integrates the two axes of sequence
conservation and binding site annotation in a simultaneous approach yielding annotated
alignments – pairwise alignments with parts annotated as putative conserved transcription
factor binding sites.
Standard pairwise alignments are extended to include additional states for binding site
profiles. A statistical framework that estimates profile-related parameters based on desired
type I and type II errors is prescribed. This forms the core of the tool SimAnn. As an
extension, we use existing probabilistic models to demonstrate how the framework can be
adapted to consider position-specific evolutionary characteristics of binding sites during
parameter estimation. This underlies the tool eSimAnn.
Through simulations and real data analysis, we study the influence of considering a simultaneous
approach as opposed to a multi-step one on resulting predictions. The former enables
a local rearrangement in the alignment structure to bring forth perfectly aligned binding
sites. This precludes the necessity of adopting post-processing steps to handle errors in
pre-computed alignments, as is usually done in multi-step approaches. Additionally, the
framework for parameter estimation is applicable to any novel profile of interest. Especially
for instances with poor sequence conservation or profile quality, the simultaneous approach
stands out. As a by-product of the analysis, we also model the annotated alignment problem
as an extended pair Hidden Markov Model and illustrate the correspondence between
the various theoretical concepts.