English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Poster

Accurate RNA-seq based de novo annotation using mGene.ngs

MPS-Authors
/persons/resource/persons84204

Schweikert,  G
Department Molecular Biology, Max Planck Institute for Developmental Biology, Max Planck Society;

/persons/resource/persons229087

Zeller,  G       
Department Molecular Biology, Max Planck Institute for Developmental Biology, Max Planck Society;

External Resource
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)
There are no public fulltexts stored in PuRe
Supplementary Material (public)
There is no public supplementary material available
Citation

Behr, J., Bohnert, R., Kahles, A., Schweikert, G., Zeller, G., Hartmann, L., et al. (2011). Accurate RNA-seq based de novo annotation using mGene.ngs. Poster presented at 19th Annual International Conference on Intelligent Systems for Molecular Biology and 10th European Conference on Computational Biology (ISMB/ECCB 2011), Wien, Austria.


Cite as: https://hdl.handle.net/21.11116/0000-0010-52AA-2
Abstract
The model organism Caenorhabditis elegans is one of the most important subjects to study cell fate and regulation of apoptosis. To gain a deeper understanding of the regulatory mechanisms in C. elegans its nearby evolutionary context was explored and the genomes of five closely related nematodes were sequenced. Currently, the major limitation in analyzing these specific genomes is that there is a lack of accuracy in the transcriptome annotation. In this project we sequenced the transcriptome (RNA-Seq) of all five nematodes and C. elegans using the Illumina sequencing platform (~300M reads, strand specific, paired-end, 76bp). Based on the RNA-Seq data we annotated all six nematodes using the newly developed de novo gene finding system mGene.ngs ( Schweikert et al 2009). mGene.ngs combines features from the RNA-Seq data and the genomic DNA sequence already at the learning stage. The system can be trained on a set of highly expressed protein coding and non-coding genes, whose structure can be directly inferred from the RNA-Seq data. The training was done independently for all 6 organisms. This is a conceptual difference to standard annotation strategies relying either on sequence alignments, classifiers trained on a single representative organism, or both. Therefore, these approaches generally tend to underestimate the proteome variability and are biased towards a single organism and/or the set of known proteins.

While our approach tends to overestimate the variability, it allows us to compare the transcriptomes and proteomes of a set of organisms on an equal footing and can sensitively detect minor changes in gene structure between organisms. Predictions include alternative isoforms supported by spliced reads as well as non-coding genes and transcripts. To evaluate the approach we take advantage of the highly accurate C. elegans genome annotation. We observe that the prediction accuracy in terms of coding transcript level sensitivity (56.1%) and specificity (62.7%) compares very favorably to the well known de novo transcriptome recognition system cufflinks Trapnell et al 2010 (sensitivity (49.9%), specificity (49.5%)).