Optimization of sequence alignments according to the number of sequences vs. 
number of sites trade-off

Dutheil, Julien Y.; Figuet, Emeric

doi:10.1186/s12859-015-0619-8

Local TagsRelease HistoryDetailsSummary

Optimization of sequence alignments according to the number of sequences vs. number of sites trade-off

Dutheil, J. Y., & Figuet, E. (2015). Optimization of sequence alignments according to the number of sequences vs. number of sites trade-off. BMC Bioinformatics, 16: 190. doi:10.1186/s12859-015-0619-8.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/11858/00-001M-0000-0027-AD52-0 Version Permalink: https://hdl.handle.net/21.11116/0000-0006-3C81-5

Genre: Journal Article

Files

show Files

hide Files

:

Dutheil_Figuet_2015.pdf (Publisher version), 868KB

View Save

File Permalink:
https://hdl.handle.net/11858/00-001M-0000-0027-AD54-C

Name:
Dutheil_Figuet_2015.pdf

Description:
-

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
© 2015 Dutheil and Figuet. This is an Open Access article distributed under the terms of the Creative Commons Attribution License.

License:
https://creativecommons.org/licenses/by/4.0/

Locators

show

hide

Locator:
http://www.biomedcentral.com/1471-2105/16/190 (Publisher version) Open Access status unknown

Description:
-

OA-Status:

Creators

show

hide

Creators:
Dutheil, Julien Y.¹, Author
Figuet, Emeric, Author

Affiliations:
1Department Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Max Planck Society, ou_1445635

Content

show

hide

Free keywords: Sequence alignment; Comparative analysis prediction; Phylogeny; Sampling; Trade-off

Abstract: Background: Comparative analysis of homologous sequences enables the understanding of evolutionary patterns
at the molecular level, unraveling the functional constraints that shaped the underlying genes. Bioinformatic pipelines
for comparative sequence analysis typically include procedures for (i) alignment quality assessment and (ii) control of
sequence redundancy. An additional, underassessed step is the control of the amount and distribution of missing
data in sequence alignments. While the number of sequences available for a given gene typically increases with time,
the site-specific coverage of each alignment position remains highly variable because of differences in sequencing
and annotation quality, or simply because of biological variation. For any given alignment-based analysis, the
selection of sequences thus defines a trade-off between the species representation and the quantity of sites with
sufficient coverage to be included in the subsequent analyses.
Results: We introduce an algorithm for the optimization of sequence alignments according to the number of
sequences vs. number of sites trade-off. The algorithm uses a guide tree to compute scores for each bipartition of the
alignment, allowing the recursive selection of sequence subsets with optimal combinations of sequence and site
numbers. By applying our methods to two large data sets of several thousands of gene families, we show that
significant site-specific coverage increases can be achieved while controlling for the species representation.
Conclusions: The algorithm introduced in this work allows the control of the distribution of missing data in any
sequence alignment by removing sequences to increase the number of sites with a defined minimum coverage. We
advocate that our missing data optimization procedure in an important step which should be considered in
comparative analysis pipelines, together with alignment quality assessment and control of sampled diversity. An
open source C++ implementation is available at http://bioweb.me/physamp.

Details

show

hide

Language(s): eng - English

Dates: Submitted: 2014-12-13Accepted: 2015-05-18Published Online: 2015-06-09Date issued: 2015

Publication Status: Issued

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: Peer

Identifiers: DOI: 10.1186/s12859-015-0619-8

Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show

hide

Title: BMC Bioinformatics

Source Genre: Journal

Creator(s):

Affiliations:

Publ. Info: BioMed Central

Pages: - Volume / Issue: 16 Sequence Number: 190 Start / End Page: - Identifier: ISSN: 1471-2105
CoNE: https://pure.mpg.de/cone/journals/resource/111000136905000