English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Journal Article

Optimization of sequence alignments according to the number of sequences vs. number of sites trade-off

MPS-Authors
/persons/resource/persons179728

Dutheil,  Julien Y.
Department Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Max Planck Society;

External Resource
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)

Dutheil_Figuet_2015.pdf
(Publisher version), 868KB

Supplementary Material (public)
There is no public supplementary material available
Citation

Dutheil, J. Y., & Figuet, E. (2015). Optimization of sequence alignments according to the number of sequences vs. number of sites trade-off. BMC Bioinformatics, 16: 190. doi:10.1186/s12859-015-0619-8.


Cite as: https://hdl.handle.net/11858/00-001M-0000-0027-AD52-0
Abstract
Background: Comparative analysis of homologous sequences enables the understanding of evolutionary patterns
at the molecular level, unraveling the functional constraints that shaped the underlying genes. Bioinformatic pipelines
for comparative sequence analysis typically include procedures for (i) alignment quality assessment and (ii) control of
sequence redundancy. An additional, underassessed step is the control of the amount and distribution of missing
data in sequence alignments. While the number of sequences available for a given gene typically increases with time,
the site-specific coverage of each alignment position remains highly variable because of differences in sequencing
and annotation quality, or simply because of biological variation. For any given alignment-based analysis, the
selection of sequences thus defines a trade-off between the species representation and the quantity of sites with
sufficient coverage to be included in the subsequent analyses.
Results: We introduce an algorithm for the optimization of sequence alignments according to the number of
sequences vs. number of sites trade-off. The algorithm uses a guide tree to compute scores for each bipartition of the
alignment, allowing the recursive selection of sequence subsets with optimal combinations of sequence and site
numbers. By applying our methods to two large data sets of several thousands of gene families, we show that
significant site-specific coverage increases can be achieved while controlling for the species representation.
Conclusions: The algorithm introduced in this work allows the control of the distribution of missing data in any
sequence alignment by removing sequences to increase the number of sites with a defined minimum coverage. We
advocate that our missing data optimization procedure in an important step which should be considered in
comparative analysis pipelines, together with alignment quality assessment and control of sampled diversity. An
open source C++ implementation is available at http://bioweb.me/physamp.