English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
 
 
DownloadE-Mail
  Optimization of sequence alignments according to the number of sequences vs. number of sites trade-off

Dutheil, J. Y., & Figuet, E. (2015). Optimization of sequence alignments according to the number of sequences vs. number of sites trade-off. BMC Bioinformatics, 16: 190. doi:10.1186/s12859-015-0619-8.

Item is

Files

show Files
hide Files
:
Dutheil_Figuet_2015.pdf (Publisher version), 868KB
Name:
Dutheil_Figuet_2015.pdf
Description:
-
OA-Status:
Visibility:
Public
MIME-Type / Checksum:
application/pdf / [MD5]
Technical Metadata:
Copyright Date:
-
Copyright Info:
© 2015 Dutheil and Figuet. This is an Open Access article distributed under the terms of the Creative Commons Attribution License.

Locators

show
hide
Locator:
http://www.biomedcentral.com/1471-2105/16/190 (Publisher version)
Description:
-
OA-Status:

Creators

show
hide
 Creators:
Dutheil, Julien Y.1, Author           
Figuet, Emeric, Author
Affiliations:
1Department Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Max Planck Society, ou_1445635              

Content

show
hide
Free keywords: Sequence alignment; Comparative analysis prediction; Phylogeny; Sampling; Trade-off
 Abstract: Background: Comparative analysis of homologous sequences enables the understanding of evolutionary patterns
at the molecular level, unraveling the functional constraints that shaped the underlying genes. Bioinformatic pipelines
for comparative sequence analysis typically include procedures for (i) alignment quality assessment and (ii) control of
sequence redundancy. An additional, underassessed step is the control of the amount and distribution of missing
data in sequence alignments. While the number of sequences available for a given gene typically increases with time,
the site-specific coverage of each alignment position remains highly variable because of differences in sequencing
and annotation quality, or simply because of biological variation. For any given alignment-based analysis, the
selection of sequences thus defines a trade-off between the species representation and the quantity of sites with
sufficient coverage to be included in the subsequent analyses.
Results: We introduce an algorithm for the optimization of sequence alignments according to the number of
sequences vs. number of sites trade-off. The algorithm uses a guide tree to compute scores for each bipartition of the
alignment, allowing the recursive selection of sequence subsets with optimal combinations of sequence and site
numbers. By applying our methods to two large data sets of several thousands of gene families, we show that
significant site-specific coverage increases can be achieved while controlling for the species representation.
Conclusions: The algorithm introduced in this work allows the control of the distribution of missing data in any
sequence alignment by removing sequences to increase the number of sites with a defined minimum coverage. We
advocate that our missing data optimization procedure in an important step which should be considered in
comparative analysis pipelines, together with alignment quality assessment and control of sampled diversity. An
open source C++ implementation is available at http://bioweb.me/physamp.

Details

show
hide
Language(s): eng - English
 Dates: 2014-12-132015-05-182015-06-092015
 Publication Status: Issued
 Pages: -
 Publishing info: -
 Table of Contents: -
 Rev. Type: Peer
 Identifiers: DOI: 10.1186/s12859-015-0619-8
 Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show
hide
Title: BMC Bioinformatics
Source Genre: Journal
 Creator(s):
Affiliations:
Publ. Info: BioMed Central
Pages: - Volume / Issue: 16 Sequence Number: 190 Start / End Page: - Identifier: ISSN: 1471-2105
CoNE: https://pure.mpg.de/cone/journals/resource/111000136905000