Abstract
As sequencing technologies expand, more individual genomes and transcriptomes become available and can be used to refine existing species annotations. Annotations are currently generated by automated pipelines and maintained by manual curation. These pipelines are limited in their ability to generalise across all domains of life. Even highly accurate pipelines produce erroneous predictions due to the large number of genes in most genomes.
PINAPL is a software suite designed to identify unannotated genes in annotated genomes. Evidence-based gene predictions are compared to existing annotations to highlight genes not present in a species‟ annotation. Evidences are collected which can help to identify a gene as plausibly existing in that species. These data are visualized in an interactive manner aimed at facilitating manual curation of results.
A number of well-supported, unannotated genes are identified by PINAPL, including RAG2 in Cod, KNL1 in Cow, and DOLK in Stickleback. These genes exhibit highly significant BLAST hits to existing genes with hundreds of orthologs, but which are missing in the respective genome annotations. Each of these genes plays an important role in cell biology and is linked to severe knockdown phenotypes. It is likely that these genes were missed by automated annotation methods.
PINAPL offers a comprehensive tool for running, visualizing, and curating unannotated genes in annotated genomes. The genes identified through PINAPL can be used to improve existing annotations to better represent an organism‟s biology, benefitting experimental and analytical work carried out based on that annotation.