Fast computation of genome distances

Klötzl, Fabian

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Thesis

Fast computation of genome distances

MPS-Authors

/persons/resource/persons191618

Klötzl, Fabian
Research Group Bioinformatics, Department Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Klötzl, F. (2020). Fast computation of genome distances. PhD Thesis, Universität zu Lübeck. Institut für Neuro- und Bioinformatik., Lübeck.

Cite as: https://hdl.handle.net/21.11116/0000-0007-9EAB-7

Abstract

To understand the evolutionary relationships between organisms, they are typicallypresented in a tree-like structure, a phylogeny. In genomic studies, phylogenies aretraditionally reconstructed from a multiple sequence alignment. While most accurate,this approach is also computationally demanding. The problem is that in order to identifyshared homologies, the sequences are usually first aligned nucleotide by nucleotide.This alignment step has become a bottleneck in the practice of molecular biology, wherethousands of whole bacterial genomes, each a few megabases long, are sequenced andthen need to be summarized as phylogenies when analyzing pathogen outbreaks.One alternative are methods that estimate evolutionary distances directly from un-aligned genomes. These pairwise distances can then be used to cluster sequences in atree. Most of these alignment-free methods heavily rely on exact matching techniques forwords of a fixed size for fast sequence comparison. However, they usually do not reflectthe substitution rate, the most widely used measure of evolutionary distance.Instead of using words of fixed size, Haubold et al. (2015) used matches of maximallength as anchors for approximate pairwise alignments. These anchor alignments then canbe used to estimate the substitution rate. A first implementation,andi, quickly estimatesaccurate pairwise distances from hundreds of bacterial genomes on standard hardware.However, the thousands of genomes currently being collected during outbreaks againslow the program down.Andiuses a suffix array as a full-text index for each of the input sequences. Since con-structing and searching in a suffix array is slow, the aim of this thesis was to investigate,whether it might be possible to just compute a single suffix array for one of the inputsequences and pile all remaining sequences onto that reference. This should produce anapproximate multiple sequence alignment, from which pairwise mismatches could becounted.This approach is implemented in the programphylonium(Klötzl and Haubold 2019). Itis available via package managers or as open source atgithub.com/evolbioinf/phylonium.Phyloniumis much faster thanandiwhile losing little of its predecessor ’s accuracy. In thisthesis I explain the background tophylonium, describe its implementation, and applyit to simulated and real data. In the application section I comparephyloniumto its bestcompetitors and show that it holds a reasonable position in the classical trade-off betweenspeed and accuracy.