hide
Free keywords:
-
Abstract:
Reference genomes are foundational to modern genetics, yet by their nature, they cannot capture the full extent of genetic diversity within a species. Representing the genetic potential of an entire species as a single linear sequence introduces inherent biases in all subsequent analyses. This reference bias has long been acknowledged, but only recent advances in sequencing technology have made it possible to address it effectively. The advent of long-read sequencing and the generation of multiple genome assemblies for the same species have enabled a more comprehensive exploration of intraspecies genome variation. These new technologies have also made possible the implementation of a longstanding concept: the genome graph. It integrates multiple reference genomes into a single data structure, offering a better representation of sequence diversity than linear references can provide.
In my work, I apply the genome graph approach to Arabidopsis thaliana by constructing a complex genome graph from six highly contiguous, de-novo assembled genomes, each annotated through the novel pan-genome aware annotation pipeline auto-ant. I demonstrate that building such a graph is not only theoretically possible, but also practically feasible. The resulting graph captures the complete pan-genome of the input assemblies, including sequences absent from the current linear reference genome. Using the reference-agnostic variant detection algorithm panSV, I am able to access this graph-based pan-genome. Furthermore, I show that short-read alignments to this genome graph are feasible and show a reduced reference bias due to the expanded reference structure. Additionally, even a graph constructed from only seven genomes proves capable of representing the broader pan-genome of a larger mapping population.
Although the method is in need of further development and improvements, I have made a first case for the use of highly complex graphs in plant species.