Deutsch
 
Hilfe Datenschutzhinweis Impressum
  DetailsucheBrowse

Datensatz

DATENSATZ AKTIONENEXPORT
  Optimization of sequence alignments according to the number of sequences vs. number of sites trade-off

Dutheil, J. Y., & Figuet, E. (2015). Optimization of sequence alignments according to the number of sequences vs. number of sites trade-off. BMC Bioinformatics, 16: 190. doi:10.1186/s12859-015-0619-8.

Item is

Basisdaten

einblenden: ausblenden:
Genre: Zeitschriftenartikel

Dateien

einblenden: Dateien
ausblenden: Dateien
:
Dutheil_Figuet_2015.pdf (Verlagsversion), 868KB
Name:
Dutheil_Figuet_2015.pdf
Beschreibung:
-
OA-Status:
Sichtbarkeit:
Öffentlich
MIME-Typ / Prüfsumme:
application/pdf / [MD5]
Technische Metadaten:
Copyright Datum:
-
Copyright Info:
© 2015 Dutheil and Figuet. This is an Open Access article distributed under the terms of the Creative Commons Attribution License.

Externe Referenzen

einblenden:
ausblenden:
externe Referenz:
http://www.biomedcentral.com/1471-2105/16/190 (Verlagsversion)
Beschreibung:
-
OA-Status:

Urheber

einblenden:
ausblenden:
 Urheber:
Dutheil, Julien Y.1, Autor           
Figuet, Emeric, Autor
Affiliations:
1Department Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Max Planck Society, ou_1445635              

Inhalt

einblenden:
ausblenden:
Schlagwörter: Sequence alignment; Comparative analysis prediction; Phylogeny; Sampling; Trade-off
 Zusammenfassung: Background: Comparative analysis of homologous sequences enables the understanding of evolutionary patterns
at the molecular level, unraveling the functional constraints that shaped the underlying genes. Bioinformatic pipelines
for comparative sequence analysis typically include procedures for (i) alignment quality assessment and (ii) control of
sequence redundancy. An additional, underassessed step is the control of the amount and distribution of missing
data in sequence alignments. While the number of sequences available for a given gene typically increases with time,
the site-specific coverage of each alignment position remains highly variable because of differences in sequencing
and annotation quality, or simply because of biological variation. For any given alignment-based analysis, the
selection of sequences thus defines a trade-off between the species representation and the quantity of sites with
sufficient coverage to be included in the subsequent analyses.
Results: We introduce an algorithm for the optimization of sequence alignments according to the number of
sequences vs. number of sites trade-off. The algorithm uses a guide tree to compute scores for each bipartition of the
alignment, allowing the recursive selection of sequence subsets with optimal combinations of sequence and site
numbers. By applying our methods to two large data sets of several thousands of gene families, we show that
significant site-specific coverage increases can be achieved while controlling for the species representation.
Conclusions: The algorithm introduced in this work allows the control of the distribution of missing data in any
sequence alignment by removing sequences to increase the number of sites with a defined minimum coverage. We
advocate that our missing data optimization procedure in an important step which should be considered in
comparative analysis pipelines, together with alignment quality assessment and control of sampled diversity. An
open source C++ implementation is available at http://bioweb.me/physamp.

Details

einblenden:
ausblenden:
Sprache(n): eng - English
 Datum: 2014-12-132015-05-182015-06-092015
 Publikationsstatus: Erschienen
 Seiten: -
 Ort, Verlag, Ausgabe: -
 Inhaltsverzeichnis: -
 Art der Begutachtung: Expertenbegutachtung
 Identifikatoren: DOI: 10.1186/s12859-015-0619-8
 Art des Abschluß: -

Veranstaltung

einblenden:

Entscheidung

einblenden:

Projektinformation

einblenden:

Quelle 1

einblenden:
ausblenden:
Titel: BMC Bioinformatics
Genre der Quelle: Zeitschrift
 Urheber:
Affiliations:
Ort, Verlag, Ausgabe: BioMed Central
Seiten: - Band / Heft: 16 Artikelnummer: 190 Start- / Endseite: - Identifikator: ISSN: 1471-2105
CoNE: https://pure.mpg.de/cone/journals/resource/111000136905000