hide
Free keywords:
-
Abstract:
The 1001 Genomes Project generated a polymorphism (SNP) and short structural variant (short SV) map for well over 1000 wild strains (accessions) of Arabidopsis thaliana. In addition transcriptome, methylation and phenotypic data for most of the accessions were collected. By utilising long read sequencing technologies to generate de novo assemblies of different diverse A. thaliana accessions, we are launching the next phase of this project, in which we will detect and genotype large SVs. First we will shift from a single reference based approach to a multiple genome graph, representing a set of highly diverse A. thaliana accessions. Based on this we will detect SVs and subsequently genotype these in the 1001 Genomes Project short read data set. Most genome graphs are constructed from a multiple whole genome alignment (WGA). Building a WGA however is not trivial and its quality depends on the excess of shared regions to form informative nodes and (super-)bubbles (PNGH) in the graph. The quality of the WGA depends on several factors, with the similarity and the repetitiveness of the aligned sequences being the major ones. The diversity will result in less and smaller alignment blocks, whereas the repetitiveness will lead to multiple alignments. Such a WGA will convert into a highly connected, partially circularized graph that contain almost no usable information as nodes are too short and edges are too abundant to reliably and uniquely anchor superbubbles around interesting structural variants. Here we propose ways to cope with diverse sequences for graph construction. Our main target is to create a low complexity graph. We alter previous graph construction approaches by focusing on local alignment anchors. The approach reduces the alignment fragmentation by only considering regions near useful alignment anchors (MUMs (DKF+99)/ Minimizers (RHH+04)) and thus prohibits self alignments, which would result in the circularization of the graph. In a second approach we only focus on regions of interest and resolve them to the highest possible resolution and skip non informative parts around them. We further show that in a finished graph, variation can be removed by pruning thus taking information, such as allele frequencies within a population data set, into account. Although our approaches result in loss of information they enable us to generate genome graphs that help to understand variation of SNPs, short and long SVs as well as TEs at an unprecedented resolution when combined with previously generated short read data.