English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Poster

Building usable full genome variation graphs

MPS-Authors
/persons/resource/persons273762

Kubica,  C
Department Molecular Biology, Max Planck Institute for Developmental Biology, Max Planck Society;

/persons/resource/persons272552

Bemm,  F
Department Molecular Biology, Max Planck Institute for Developmental Biology, Max Planck Society;

/persons/resource/persons85266

Weigel,  D
Department Molecular Biology, Max Planck Institute for Developmental Biology, Max Planck Society;

Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)
There are no public fulltexts stored in PuRe
Supplementary Material (public)
There is no public supplementary material available
Citation

Kubica, C., Bemm, F., & Weigel, D. (2017). Building usable full genome variation graphs. Poster presented at German Conference on Bioinformatics (GCB 2017), Tübingen, Germany. doi:10.7287/peerj.preprints.3268v1.


Cite as: https://hdl.handle.net/21.11116/0000-000A-6EB6-E
Abstract
The 1001 Genomes Project generated a polymorphism (SNP) and short structural variant (short SV) map for well over 1000 wild strains (accessions) of Arabidopsis thaliana. In addition transcriptome, methylation and phenotypic data for most of the accessions were collected. By utilising long read sequencing technologies to generate de novo assemblies of different diverse A. thaliana accessions, we are launching the next phase of this project, in which we will detect and genotype large SVs. First we will shift from a single reference based approach to a multiple genome graph, representing a set of highly diverse A. thaliana accessions. Based on this we will detect SVs and subsequently genotype these in the 1001 Genomes Project short read data set. Most genome graphs are constructed from a multiple whole genome alignment (WGA). Building a WGA however is not trivial and its quality depends on the excess of shared regions to form informative nodes and (super-)bubbles (PNGH) in the graph. The quality of the WGA depends on several factors, with the similarity and the repetitiveness of the aligned sequences being the major ones. The diversity will result in less and smaller alignment blocks, whereas the repetitiveness will lead to multiple alignments. Such a WGA will convert into a highly connected, partially circularized graph that contain almost no usable information as nodes are too short and edges are too abundant to reliably and uniquely anchor superbubbles around interesting structural variants. Here we propose ways to cope with diverse sequences for graph construction. Our main target is to create a low complexity graph. We alter previous graph construction approaches by focusing on local alignment anchors. The approach reduces the alignment fragmentation by only considering regions near useful alignment anchors (MUMs (DKF+99)/ Minimizers (RHH+04)) and thus prohibits self alignments, which would result in the circularization of the graph. In a second approach we only focus on regions of interest and resolve them to the highest possible resolution and skip non informative parts around them. We further show that in a finished graph, variation can be removed by pruning thus taking information, such as allele frequencies within a population data set, into account. Although our approaches result in loss of information they enable us to generate genome graphs that help to understand variation of SNPs, short and long SVs as well as TEs at an unprecedented resolution when combined with previously generated short read data.