Abstract
We have sequenced the genomes of 18 inbred accessions of Arabidopsis thaliana at ~40x coverage using paired-end Illumina sequencing with different insert sizes. We developed an assembly pipeline that uses iterative read mapping and de novo assembly to accurately recover genome sequences with an error rate close to 1 in 10kb in single copy regions of the genome, and 1 in 1kb in repetitive or transposon rich loci, as assessed with independent data. Naive projection of the coordinates of the 27,416 protein coding genes in the reference annotation onto the 18 genomes predicted large effect disruptions in 8,652 (32%), suggestion that Arabidopsis thaliana is able to survive disruptions in up to a third of its genes. To shed light onto this high number, we developed a novel pipeline for de novo annotation combining computational gene prediction and RNA-seq data from seedlings cultivated in highly controlled environments at ~20x coverage. Using this pipeline, we re-annotated each genome, finding that whilst there is considerable variation in gene structure, compensating changes help to ensure that altered transcripts can retain function. Thus 8,757 genes had at least one additional or modified transcript in at least one accession. In particular, for 8,322/8,757 (96.2%) genes harbouring large effect disruptions in at least one accession, the naively mapped transcripts were replaced by alternative transcripts. The effect of DNA sequence variation and altered gene models can be better understood when investigating the resulting protein sequence. Thus, we analysed how the transcripts‟ diversity affected their 40,578 inferred protein sequences, finding 3,840 (9.5%) proteins that had less than 50% amino-acid sequence identity with the corresponding TAIR10 proteins. Protein diversity varied across gene models and we found isoforms with severe disruptions to occur with generally low frequency in the accessions. To complement the genotype-focused analysis, we investigated the quantitative transcriptome variation using the obtained RNA-seq reads. We found 20,963 (78%) of all protein genes to be expressed in at least one strain, with 9,360 (45%) exhibiting significant expression variation between strains. Mapping causal variants affecting gene expression, we identified variants associated with expression polymorphisms near 941 (10%) of differentially expressed genes. These candidate cis-eQTLs are tightly mapped, and analysis of the location of eQTLs relative to local gene models revealed an excess of associations in regulatory regions, including the core promoter region and 3‟UTR. This is one of the first studies where multiple genomes from a single species have been assembled, re-annotated and integrated with their transcriptomes to understand and quantify the regulatory role of natural DNA variation on gene structure as well as expression. It may serve as a blue print for forthcoming studies.