To cite PlanMine, please refer to the following publication:
Rozanski, A., Moon, H., Brandl, H., Martín-Durán, J. M., Grohme, M., Hüttner, K., Bartscherer, K., Henry, I., & Rink, J. C.
PlanMine 3.0—improvements to a mineable resource of flatworm biology and biodiversity
Nucleic Acids Research, gky1070. doi:10.1093/nar/gky1070 (2018)
PlanMine is a database for mining flatworm genomes and transcriptomes. We host transcriptomes and the genome of the planarian model species S. mediterranea, as well as transcriptomes of “wild” planarian species and a broad range of parasitic and non-parasitic flatworms for comparative analysis. You can search PlanMine by sequence (using our BLAST query) or by annotation (this is easiest using our predefined templates). Mineable information currently includes BLAST homologies, GO-terms, orthologues in other planarian species, gene expression information and taxonomic information on the represented species. PlanMine is built with the IntermMne data warehouse platform, which is also used by other model organism communities including WormMine, FlyMine, YeastMine, and ZebrafishMine and thus allows easy cross-system comparisons.
Check CheckTranscripts are named using the following pattern:
Each transcript id takes the form of a two letter city code, followed by a four letter
species code, followed by a version number, followed by a number which represents a unique
transcript ID in that assembly. Therefore the Dresden S. mediterranea transcripts that have had
their annotation updated 6 times are prefixed with dd_Smed_v6_*
and followed by the
transcript ID number.
City identifiers currently in use include: dd (Dresden- Rink lab); mu (Muenster- Bartscherer lab); be (Berlin- Rajewski lab); to (Toronto- Pearson lab); ox (Oxford- Aboobaker lab); uc (Urbana Champaign- Newmark lab); bo (Boston- Reddien lab); ka (Kansas- Sanchez lab); gr (Groningen- Berezikov lab); bg (Bergen- Hejnol lab). In addition, there are ex (Ensembl) and PublishedTranscript (transcripts in publications) which are not city identifiers.
Taking the example of dd_Smed_v6_10001_0_1
the ID prefix is dd_Smed_v6
(as described
above).
The 10001_0_1 is derived from the ID assigned based on the transcript ID from the trinity
assembler (comp10001_c0_seq1). We simply remove comp, c and seq characters and keep
the underscores.
For user-contributed assemblies we add the ID Scheme prefix but the transcript suffix is taken from the origin contributor’s nomenclature for easy comparison with source data.
Trinity assembled transcripts are classified into genes using the underlying graph component. This can be obtained by trimming the last underscore number (i.e. seq1, seq2, seq3) from the corresponding trinity transcript ID. Therefore, the final number in a transcript ID is omitted in this pseudo-gene ID. This final number in the full transcript ID is therefore used to refer to isoforms of a gene.
Predicted gene models (currently only available for S. mediterranea) are named using a four-letter taxonomy code,
followed by a 8-digit gene id. Each gene Id is furthermore suffixed with
a version number. Example: SMESG00000001.1
. Here, "SMES" designates the sexual strain of S. mediterranea as the origin of the genome on basis of which the gene predictions were made. "G" designates gene predictions and the following number the unique identity of this gene, so in this case SMES gene number 1. The suffix designates the version of the prediction, so in this case version number 1. The inevitable future updates to the gene models will only change the version suffix, but not the actual gene ID. This means that a given locus will keep its gene ID across releases.
The same naming scheme is also used for the cDNA transcripts that belong to a given gene prediction. Example: SMEST00000001.1
. Here, the "T" designates transcripts and the following 8-digit numbers is the same as for the gene of origin. The suffix of transcript IDs designates transcript identity, as genes often give rise to multiple transcripts. This also means that the specific identity of transcripts cannot be preserved across future gene model versions due to the lack of version encoding in transcript names. Our gene/naming transcript scheme is generally inspired by ENSEMBL, which also does not keep track of transcripts across releases.
Note that the current gene and transcript model versions can be downloaded under the "data sources" tab of PLanMine (see below).
Planmine designates a candidate / putative gene symbol and description.
These are selected from among the pooled top3 blast hits of all associated transcripts against
refseq_protein
with a non-draft symbol
using the hit with lowest e-value. Symbols and predictions are preliminary,
and users can submit corrections and change requests via the feedback form. Since not all gene models blast against
reference genes, some gene models in PlanMine won't have a symbol intially.
Because secondary high confidence blast results may have other symbols, we also report them as synonyms in the header of each gene model page
It is important to note that PlanMine does not use gene names for trinity assemblies, e.g. “b-Catenin-1”. Therefore, even transcripts originating from published genes will be designated by the above ID scheme. The annotated “published genes” list offers a workaround for querying/analyzing transcripts on basis of published gene names. Gene name searches bring up entries in the “published genes” list. From there, you can retrieve corresponding transcripts in the various assemblies either via the list of orthologous or by using the provided fasta sequence of the published gene as BLAST query. Transcripts are further associated with annotation information through protein functional domain predictions (InterProScan) or RefSeq transcripts with strong Blast homology. These transcript annotations can also be searched via the Search box in the top right hand corner of every page.
This section gives a brief overview of the PlanMine functions accessible via the tabs in the page header. Just follow the links for detailed explanations of individual functions or try out the tutorials at the end of the help manual.
This is our home page allowing you to access many common tasks (see the Home section for more details)
Planmine allows to blast arbitrary protein and nucleotide query sequences against all species in PlanMine but also against various reference species.
The integrated BLAST search page allows you to search against all transcripts stored in PlanMine with a query sequence of your choice. The type of query, e.g., protein or nucleic acid, is automatically recognized. You can search against all assemblies or just select those you wish. Advanced BLAST parameters are also supported for search customization. Blast functionality is provided through the integration of the SequenceServer tool. See the BLAST section for more details.
PlanMine integrates an instance of the UCSC genome browser. This allows to view various tracks include coverage, repeat-masks, and Smed transcriptome.
This is a list of predefined queries that are useful to most users. It is possible to customize and modify these predefined templates and also create and save your own if you are logged in to PlanMine (myMine). See the Templates section for more details.
This tool allows the creation of more complex queries and users to edit existing template queries. For the majority of cases a template should already exist to perform the query you require. QueryBuilder is very powerful but takes some time to master. See the Query Builder section for more details and check out our tutorial.
Lists are a very powerful feature of PlanMine, allowing you for example to export multiple transcript sequences or to perform GO-term enrichment analysis on differentially expressed genes. Available lists include all the transcriptomes and stem cell- progenitor- or differentiated cell enriched transcripts (based on Labbe et al.). You can also create your own lists, either by importing lists of transcript identifiers, or by saving lists from search result tables. See the Lists section for more details.
The data sources page is divided into four categories: "Genomes", "Gene Predictions", "Transcriptomes", and "Contributors".
"Genomes" page and "Gene Predictions" page contain genomes fasta file and our all versions of gene predictions in GFF3 format to download respectively. "Transcriptomes" page has overview information about the transcriptomes stored in PlanMine. This includes basic assembly statistics (where available), an assembly report with detailed assembly information for each transcriptome, the ability to to bulk download information and the contact information of those that have contributed data to PlanMine. "Contributors" page shows all the institutes and labs who have contributed to PlanMine databases.
The "Genomes" page contains source sequence data (FASTA format) for genomes included in PlanMine. At present the S. mediterranea genome (dd_Smes_g4) is available for download.
The "Gene Predictions" page contains several files to download that give gene predictions (GFF3 format) based on specific filter criteria.
The "Transcriptomes" page displays overview information regarding all the transcriptomes available in PlanMine. This includes basic assembly statistics (where available), assembly reports with detailed assembly information for each transcriptome, and the ability to bulk download information. Contact information for those that have contributed data to PlanMine is also provided here. See the Assembly report section for more details.
Additionally, the "Contributors" page gives details regarding all the institutes and labs who have contributed data to PlanMine. If you have contributed data and are not included on this page please contact us.
If you are using PlanMine as part of an automated workflow it is possible to access the data stored in PlanMine programmatically using the APIs provided (Perl, Python, Ruby, Java).
The creation of a user account through the MyMine tab allows you to save your own queries, lists, and templates. Just try it out, no strings attached, and you will be able to get the most out of PlanMine.
Gene and Transcript information pages are perhaps the most useful pages in PlanMine. They include all data that is known about a given both gene and transcript, including associated features(if association is available), expression level(only transcript for now), protein domain occurrences(only transcript), Blast homology(only transcript) outside and within planarians and, for the dd_Smed and dd_Dlac transcripts, differential expression of the gene under diverse experimental conditions. See the sections below to find out more about the information available for genes and transcripts.
The gene view gives a graphical overview of features associated with a gene model. First, it highlights the genomic locus. Second, it links to associated transcripts.
Especially, JBrowse in gene page shows all the mapped transcripts along the genomic locus where users can easily check how specific transcripts are located as all the associated sequences are shown together.
The transcript view gives a graphical overview of features associated with a transcript using our integrated JBrowse viewer. Blast homology to proteins in RefSeq is shown in green, protein functional domains are shown in blue, predicted open reading frames (ORFs) and their direction in orange, gene family predictions in yellow (TreeFam), and the read coverage of the transcript used for our assembly with the Trinity assembler is shown as a light blue coverage track (only shown for internally assembled transcripts not for contributed assemblies).
Jointly, this information allows you to:
Learn which gene you are dealing with: The blast homologies already give you a very good idea if your sequence of interest has homologues in RefSeq.
Spot chimeric transcripts: Disparate BLAST annotations, more than one long open reading frame (especially when facing in opposite directions) and/or vastly different coverage densities along the transcript are all possible chimera indicators.
Obtain the translated sequence of the ORF: Clicking a graphical ORF representation in the JBrowse viewer brings up a pop-up window with the protein translation of the ORF, providing a useful shortcut from nucleotide to protein sequence.
Get information about the expression level of your gene: The scale of the raw read density track provides a useful expression metric (within the dataset used for transcriptome assembly).
An extremely useful feature of PlanMine is the ability to query the differential expression of a transcript under diverse experimental conditions. For this purpose, we re-map raw reads from published RNA-Seq datasets against the dd_Smed and dd_Dlac assembly.
Two types of gene expression data are currently available,
RNAi experiments and
expression in different cell types.
For RNAi experiments, the results are graphically displayed in the form of a bar chart showing up-or down-regulation of the gene relative to control and, by colour coding, whether or not the change is significant (defined using edgeR with raw counts, and filtered with an False Discovery Rate (FDR) cutoff of 1% as is described on the Trinity website. Please note that our criteria for defining significance may be different from the ones used in the original study, which may lead to differences in the results. The section to the right designates 1) the gene that was knocked down, 2) a cartoon illustrating the analyzed tissue area and the location and orientation of the amputation cut (in case of a regeneration experiment) and 3) further information about the analyzed sample (e.g. time post amputation). Note that you can access both the original raw data of the study and the publication using the quick links provided.
For cell type experiments, the bar graphs symbolize expression level (FPKM) in the respective cell populations WITHOUT normalization to a control. This display mode is useful for visualizing gene expression differences between stem cells, progeny or differentiated cells. The sample information to the right is organized as described above.
For man power constraints, gene expression data is currently only available for dd_Smed and dd_Dlac transcripts. If using one of the other Smed transcriptome as starting point, simply identify the respective dd_Smed orthologue and access expression information via the respective dd_Smed transcript page. We are always happy to include new or unpublished gene expression data sets in PlanMine- please notify us of any data sets you would like us to include.
Please note that the PlanMine search function or the query builder can be used to extract all genes that are significantly differentially expressed in a given RNAi experiment. Please see the respective tutorial for how-to information. This does not work for the cell type experiments (because we display raw fpkm rather than fold change). We therefore provide pre-saved lists of transcripts that are differentially expressed between the three cell populations (low stringency: significant expression difference AND fold-change >3; high stringency: significant expression difference AND fold-change >8). You can access and analyze these lists via the Lists tab.
Since v3, PlanMine also integrates sc-Seq results as published in Cell type transcriptome atlas for the planarian *Schmidtea mediterranea by Reddien et al., 2018 and of Cell type atlas and lineage tree of a whole complex animal by single-cell transcriptomics by Plaas et al., 2018
Reciprocal Blast best hits are used to define orthologues in other species or other assemblies of the same species. These data are useful for finding corresponding transcripts in other assemblies. Please consult the respective Reference Section for the orthology search parameters.
Since v3.0, we also provide a more sensitive set of orthologous transcripts for S. mediterranea. For details see below.
All schmidtea mediterranea transcriptomes were mapped to the Schmidtea mediterranea genome (g4) using GMAP with parameters: --chimera-margin=200 --min-trimmed-coverage=0.6 --min-identity=0.6. After that, the overlap of genomic coordinates between transcripts have been used to establish homologous transcripts and these are listed at Associated Features section in each transcript page.
Gene Ontology information for a transcript is taken from the GO information stored in the NCBI database for RefSeq proteins that show strong Blast homology with our transcripts. See the Reference section for blast homology settings.
All transcriptomes stored in PlanMIne have been annotated with RefSeq BLAST homologies. To keep the information manageable, we only display the top three RefSeq hits with preference to well annotated model organisms in this table. This information provides a useful indication as to what the function of the transcript might be, but please note that automatically generated annotations are not perfect and that manual double checking is a good idea for important results. See the reference section for specific parameter settings.
Domain hits annotated by InterProScan are similarly shown to give a potential indication of protein function. For more information about domain annotation see the reference section.
Predicted Open Reading Frame information is displayed along with the ability to retrieve either the nucleotides sequence of an ORF or a translated ORF protein sequence. See the Reference section for specific parameter settings.
The Home Page contains three main items: the Blast search box, the template search box, and thumbnail image links to the species pages of the planarian species currently represented in PlanMine.
In the top left is the Blast search box, which allows you to paste a transcript in fasta format into the text area field and select a transcriptome to search against. For users requiring more advanced searching capability we recommend visiting the dedicated Blast page by clicking on the Blast tab at the top of the page. Note that multi-query searches are currently not possible.
The template tabs to the right of the Blast search box allows easy access to a number of pre-configured searches to query transcriptomes by functional annotation data. The results of such searches return tables of data that can be sorted and filtered with powerful data-type aware functionality.
Tabs allow users to
With a S. mediterranea transcript ID from one of the Smed assemblies it is possible to retrieve orthologous transcripts in Smed assemblies contributed by other research groups and display just the orthologous IDs or the longest Open Reading Frame (ORF) corresponding to these orthologous IDs. The “Keyword -> Smed published transcripts search is a useful shortcut for finding/analyzing published genes in PlanMine.
These predefined searches are useful for searching all transcripts in PlanMine using a common gene symbol or keyword (these are annotated to a transcript through their Blast orthologues in other annotated organisms in RefSeq), InterPro domain name (annotated using InterProScan), other protein domain name (also annotated through InterProScan but includes PFAM and Smart annotations), Treefam ID (annotated based on the Treefam database of gene families), or simply by a known transcript ID.
Here it is possible to retrieve sequences matching (partial matching supported) a particular transcript ID or just return the longest Open Reading Frame (ORF) corresponding to an ID. It is also possible to search for keywords (e.g. gene name or description) associated with known published transcripts resulting mainly from classical cloning techniques.
These searches allow users to search PlanMine by a known transcript ID and return a table of data that can be filtered and sorted. It is possible to return four kinds of table covering transcript annotated domains, blast annotation, gene ontology terms, and Treefam domains. In addition, follow the tutorial below to see how to perform a GO term enrichment analysis on all the transcripts that are differentially expressed in the Bartscherer lab FoxD(RNAi) experiment.
For advanced searches the QueryBuilder tool is very powerful and for help with this see the section below.
This table shows pictures of the planarian species for which transcriptomes are currently available in PlanMine.
Clicking on the picture or name of a flatworm takes you to a species page with expert-curated information about the species (anatomy, habitat, regenerative ability, geographical distribution, and reproductive strategy) along with close-up images useful for identifying the species, a list of relevant taxonomic publications, a geographical distribution map (courtesy to the Turbellarian Taxonomy Database), and summary information about the transcriptomes in PlanMine relevant to this species. Note that the blue icon on the location map indicates the sampling location of the strain used for the PlanMine transcriptome. Further, we provide a link to the Turbellarian Taxonomy Database as current expert repository of taxonomic information on Planarians.
Clicking on “Blast against ID” takes users to a Blast search page with filters predefined to only search against that particular species.
In the top right hand corner of every page you will find the PlanMine Search Box. The Box allows you to search all of PlanMine for:
Type in identifiers, names or keywords for transcripts, proteins domains (InterPro, PFAM, SMART, PROSITE), gene families (TreeFam), and gene ontology terms.
Use OR to search for either of two terms or quotation marks to search for phrases (e.g. “dna binding”).
Search also supports AND, NOT searches in a similar way and wildcard searches using are supported for partial matching (e.g. *catenin matches any term ending in catenin).
Preconfigured queries that search specific fields can be found in the templates box.
All occurrences of the search term are already grouped into different categories. Clicking a category filters the search results based on category (e.g. Transcript, Interpro Domian, Other Protein Domain, GO Term). It is also possible to filter by organism and assembly (for species with more than one assembly). When results are filtered by categories you can also save the results as a list for further use.
Filtered search results can then be saved as lists for later use (for more information about lists see the lists section). Note that you need to create a “MyMine” account in order to store/retrieve lists created in a previous session.
The input field takes a sequence in fasta format as is standard for Blast searching. The type of sequence, i.e., protein or nucleotide, is automatically recognized.
We primarily provide nucleotide databases containing all transcripts for an
assembly in PlanMine. The interface is based on phylogenetic trees focusing on “S. mediterranea”,
“Planarians”, “Flatworms” and “All”. In each preset, the phylogeny-based databases with
checkboxes are listed. You can search against the one or more specific assemblies of
interest by checking the boxes of the species as well as check the select all box to
search against all assemblies. If you enter a protein sequence, a tblastn
search is
automatically carried out.
Any additional advanced parameters can be added here, such as an
e-value cutoff (clicking on the question mark gives a full list of
advanced options). Then clicking the Blast button will perform the
correct kind of blast based in input sequence type and selected Blast
databases. Note that decreasing the e-value cutoff (e.g., -evalue 1.0
)
is sometimes useful for identifying weak homologies.
We have customised the Blast output provided by SequenceServer to add a query transcript specific graphical view. For this reason we recommend submitting single query sequences rather than multi-fasta queries.
The graphical output shows a query centric view with a line representing the query sequence at the top and multiple coloured lines representing high scoring pairs (HSPs) below. HSPs are coloured by e-value with very confident hits in red and weak hits in blue. The query centric viewer also provides information on the parts of a transcript that are not part of the HSP. This is particularly useful for spotting fragmented transcripts or so-called chimeras, erroneous fusions between two independent transcripts. Chimeras are a common error category in de novo assemblies. For instance, in queries against multiple Smed assemblies, a transcript displaying a long non-homologous extension in only one assembly is likely to be a chimera.
A standard Blast output summary table is also shown allowing fasta sequences to be downloaded for later use and links are present to link to individual HSP alignments.
Clicking a particular transcript ID opens the transcript page for this particular transcript.
Templates provide predefined searches that are commonly useful and generally required by the community. Templates provide a simple form to paste your input data (keyword, id, list) and query the database using various filters to give a results table.
Pre-configured templates cover questions such as:
Comparing Schmidtea assemblies: I have a dd_Smed transcript id but what is the orthologous id in the ox_Smed assembly
Comparing Species: I’d like to find all transcripts with a particular InterPro protein domain
Comparing Functions: I’d like to find all Gene Ontology terms associated with a Transcript ID
Transcripts: I’d like the longest Open Reading Frame detected within a particular transcript
The templates tab shows a list of all available templates including user defined queries saved into MyMine if the user is logged in. Templates can be saved with a unique named and a description along with keyword tags to enable easy searching.
For advanced searches the QueryBuilder tool is very powerful and for help with this see the section below.
QueryBuilder allows in depth access to the underlying data model to provide powerful search capabilities through an as intuitive as possible user-interface. However, such an interface can still be overwhelming for new users In our tutorials section we give an example as to the use of QueryBuilder.
Lists provide a powerful means for querying and manipulating sets of data at once. You can use lists to:
store and retrieve search results
analyze the occurrence or enrichment of protein domains, GO-terms or other types of annotation within a transcript set
Combine lists to create a union
Get the intersection or differences between two lists
use as input to our custom templates for searching the database
export the nucleotide sequences
Creating lists from external data: You can simply cut/paste transcript IDs from external files into the window (e.g., Excel files of RNA-Seq results) or you can import saved text files with transcript IDs using the import button.
Creating lists within PlanMine: You can also create lists via the various search functions of PlanMine. Here we save the output of a keyword search as a list:
Our widgets also allow GO term or protein domain enrichment analysis to be done on saved lists.
By default, transcript lists are displayed as follows:
Each column can be modified using the four icons in the header line (from left to right): “sort by this column”, “remove this column”, “toggle column visibility, and finally the “column summary” button, which provides a very useful overview of your data and options to filter it appropriately.
The nice thing about the column summary is that the summary you get depends on the structure and type of the data in the column selected as can be seen below:
In each case, you can filter according to the type of data represented in the column, e.g., by a particular species in the first example or by a specified range of transcript length in the second example.
The manage columns buttons above the table allow you to add further columns to the table (e.g., GO terms or BLAST annotations) and further filtering options. From left to right, the icons are: “Manage Columns” , “View active filters”, and the “Undo” button.
The manage columns button allows users to reorder columns in the table or even add or remove columns. The addition of new columns is based around the QueryBuilder functionality and the underlying data model. This is a powerful feature that allows you to make full use of the different layers of annotation in PlanMine. Even if you initially just import a list of transcript IDs, you can, by adding columns, add all PlanMine features to your analysis, including for example BLAST descriptions, GO terms or differential expression status in a given RNA-Seq data set. Together with the various filter modalities, this allows you to identify the transcripts that you are dealing with; carry out GO term enrichment analysis (see below) or you can learn which of your transcripts are differentially expressed in one of the RNAi RNA-Seq experiments stored in PlanMine. The active filter button shows active filters and allows new ones to be added through a Query Builder like interface. Finally the Undo button allows you to revert previous actions and return to an unfiltered, default sorted table. The large number of column options and filter functionalities may appear a bit daunting at first, but an exploration of the various feature is well worth the effort and will allow you to get the most out of PlanMine.
Tabular data can also be export in various ways using the buttons shown below:
From left to right, we have the “list button” , the “get code button” , and the “download button”. The download button is particularly useful for routine applications, allowing, amongst other options, the export of fasta files or Excel tables.
Widgets also allow GO term or protein domain enrichment analysis to be done on saved lists. Opening a saved list automatically brings up the widget window at the bottom of the list:
You can select the type of enrichment algorithm, the cutoff p-value or the ontology group to analyze. Below, you have the option to modify the background population list. The default is the full transcriptome assembly from which your list derives. Some searches may require modification of the default, for example lists combining multiple transcriptomes.
The table displays the individual GO terms with p-values above the cut-off and the number of transcripts annotated with the term in your list. By selecting a particular GO-term and clicking “view” or “download” you can retrieve the ID’s of the transcripts that are annotated with a particular GO term.
The Protein Domain Enrichment to the right provides the same analysis and retrieval options on basis of domain annotations.
Raw reads were trimmed using cutadapt
and cleansed from common
illumina and pcr-amplification adapters. Subsequently, reads
low-complexity reads (More than 75% A or T rich) were removed. Remaining
reads were assembled with Trinity
(default settings). Raw transcripts were blastn’ed using an e-value cutoff
of 0.0001 against the refseq_genomic
database to identify and to remove
bacterial contamination. Thereby we considered transcripts as bacterial if
they blasted with percentage identity greater than 90% and query
coverage greater than 90% against a bacterial reference sequence.
Finally, we a used read direction bias (if present in the data) to
correct transcript directionality.
The NCBI blast+ tool was used in blastx mode (-evalue 0.0001
-num_alignments 100 -num_threads 6 -outfmt ‘6 std ppos slen qlen qframe
sframe’
) to reveal sequence homologues in the refseq_protein database.
Just those blastx hits were retained that fulfilled the following filter
criterion: ((subject_coverage>0.2 & query_coverage>0.2) |
e_value< 1E-30) & (PC_similarity > 40)). To limit the number
of blastx results in PlanMine, we preferred hits in mouse, human, or
fruitfly over those in other species, and kept just the most significant
3 blastx hits for display in PlanMine. Our transcripts are also assigned
Gene Ontology (GO) terms based on the terms associated at the NCBI with
these RefSeq proteins identified through high Blast Homology to our
transcript.
To predict domain content assembled transcripts were translated in all 6 reading frames, split up at stop-codon positions, and the resulting chunks processed using InterProScan 5. In addition to InterProDomains we also incorporated treefam domains into our annotation workflow to provide gene-family annotations. These tools were run without any alteration to the default settings.
To annotate open reading frames in the assemblies, we used getorf
(EMBOSS tool) and refined it’s results to also include fragments without
a start or stop codon. Just open reading frames longer than 30 amino
acids were included into PlanMine.
To establish an inter- and intra-species homology relation between all assembled transcripts, we performed a reciprocal blastp (—evalue 0.001) analysis between the longest ORFs of each trinity graph component. This analysis was carried out separately for all pairwise assembly combinations. The resulting graph-component homology relations were extrapolated to all transcripts of the corresponding graph components.
Sequencing data was mapped back against each assembly (bowtie2 -k15) and the alignment results converted into read coverage tracks using genomeCoverageBed.
We filtered the raw assemblies to create a putative protein coding fraction (PCF) for import into the PlanMine database. Transcripts were kept if they contained an open reading frame longer than 75 amino acids, an annotated domain, or a blast hit.
Since most mentioned annotation steps are computationally demanding, processing was performed in a batch-wise manner using an HPC environment.
Assembly reports were created for each transcriptome in PlanMine, with different levels of reporting depending on whether we assembled from raw reads or just annotated a contributed assembly.
The first step in the workflow is to run the raw sequencing reads through FastQC to assess their quality.
We then do contamination filtering by performing a blastn with an e-value cutoff of 0.0001 against the refseq_genomic database to identify and to remove bacterial sequences. We consider transcripts as bacterial if they blast with percentage identity > 90% and a query_coverage > 90% against a bacterial reference sequence. A potential list of contaminants is generated for subsequent filtering.
Frequency plots of Blast high scoring pair query coverage are plotted to see how our transcripts compare to those they have blast homology with.
The frequency of domains predicted by InterProScan domain prediction tools is also reported, along with the number of transcripts that have at least one annotated domain.
Reading frames of predicted domains are also reported, which enables us to see if transcripts are mostly in the forward (frame 1-3) or reverse direction (4-6). Finally frame specificity is reported to see whether, if more than one domain is present, domains are in the same reading frame, to give an indication of the presence of frame-shifts
We also tested each assembly for the presence of Wnt and Frizzled gene family members: Genes were extracted if they contained respective domains, and added to a multiple sequence alignment containing genes of the same family. From these extended MSAs we inferred phylogenetic trees using clustalw (-BOOTSTRAP=1000 -KIMURA -BOOTLABELS=node). By doing we hope to reveal how assembled Wnt and Frizzeld transcripts relate to orthologues in to other (planarian) reference species.
To assess the open reading frames within our transcripts three plots are created. Firstly, the distribution of the longest ORF for each transcript is plotted, secondly the ratio of the longest ORF to transcript length is plotted, and finally the relationship between transcript length and the ratio of the longest ORF to transcript length is plotted.
After an assembly is complete we map all the reads used to create the assembly onto the final assembly to create coverage tracks and assess the median coverage of our transcripts.
In order to assess how complete our transcriptome is we blast all assembled transcriptomes against a set of [eukaryotic core genes]. We then order our transcripts in terms of their query coverage with this set and plot the query coverage of eukaryotic core genes against the percentile of our transcripts that has such a coverage.
In this section we report the fraction of transcripts filtered using our filtering criteria of having either annotated Blast Homology, an annotated domain or an open reading frame of more than 75 amino acids. Plots are also created showing the transcript length distribution of those filtered vs. those transcripts kept, and the distribution of the longest Open Reading Frame per transcript.
Venn diagrams showing the numbers of transcripts with various annotation types used for filtering are shown.
Finally the number of filtered transcripts (isoforms) for each Trinity graph component (gene) is shown.
Gene models for the Schmidtea mediterranea genome (Grohme et al., 2018 ) were predicted using the custom-designed pipeline cartooned above. The pipeline generally utilizes a combination of BRAKER (Hoff et al., 2015) and Augustus (Stank et al., 2003) and non-standard elements are described in the following sections. Expression evidence was provided by a sexual strain RNA-Seq dataset (SRX2700685) that was mapped to the genome sequence using STAR (Dobin et al., 2013) with standard parameters. The resulting read alignments and the underlying genome sequence were used as input for BRAKER. BRAKER carries out the initial training of Augustus by exploiting spliced read alignments. The end result of the training process is an Augustus model.
Since Augustus is primarily a coding gene prediction tool, its prediction of untranslated regions (UTRs) are relatively poor. To partially overcome this problem, we made use of Schmidtea mediterranea Ribo-seq datasets that have a higher coverage of 5’-UTR regions in comparison with standard RNAseq methods. Specifically, we submitted the Ribo-Seq dataset (SRX2700685) to our previously described de novo transcriptome assembly pipeline (Brandl et al., 2016) to generate transcripts with improved UTR representation. TransDecoder (Haas et al., 2013) was then used to identify coding regions, 5’ and 3’ UTR in the assembled transcripts. Complete transcripts having both UTRs (5 and 3') were flagged and their UTR sequences were extracted on basis of the TransDecoder annotations. The resulting data set was then used as additional training input for Augustus to improve UTR prediction.
Augustus encompasses workflows that use NGS-expression data to cue gene predictions to transcribed loci (referred to as “hints” in the Augustus user manual). The use of such hints improves the accuracy of Augustus gene predictions. We used the Ribo-Seq and RNA-Seq dataset SRX2700685 to generate hints. The Augustus parameters for gene prediction were: --alternatives-from-evidence=true, --alternatives-from-sampling=true and --sample=100.
Gene prediction resulted in a total number of 82007 genes and 132355 mRNAs. This data set is referred to as SMESG-all and available as browsable track in PlanMine and for download under the data sources tab.
The S. mediterranea genome is highly repetitive (Grohme et al., 2018 ) and a large fraction of the gene models in SMESG-all pick up repeat elements. In order to enrich for bona fide protein coding genes, a repeat filter was implemented. Briefly, Repeat Masker, with the custom repeat library used for the original genome annotation (Grohme et al., 2018 ), was run in sensitive mode in order to scan the predicted cDNAs for repeat content. All predicted cDNAs with >25% of exonic sequences identified as repetitive were eliminated. Note that the 25% threshold was established empirically using the “ground truth” cDNA set (see below) to minimize the elimination of protein coding genes. Repeat filtering reduced the prediction set to a total number of 30917 remaining gene models and 48281 transcripts. This data set is designated SMESG-repeat filtered and is likewise available for download under the data sources tab of PlanMine.
To finally deplete models with poor expression support (e.g., pseudogenes, false positive gene predictions) from the remaining data set, we filtered the remaining models for minimal RNAseq read coverage. To account for the possibility of differential gene activity between the two strains of S. mediterranea, we took into account RNAseq data sets from both the sexual (SRX2700685) and the asexual strain (extensive data set compiled for the assembly of the dd_Smed_v6 reference transcriptome; (Brandl et al., 2016)). Predictions with RPKM (sexual reads) or FPKM (asexual reads) equal or greater than 0.1 were selected. Note that the filtering criteria were empirically determined on basis of lowly expressed genes (Smed-ovo and Smed-wnt1) to prevent inadvertent elimination from the data set. Expression evidence filtering reduced the data set to 22192 genes and 31102 transcripts, which we refer to as the SMESG-high confidence data set. By analogy, we termed the 8725 gene models that became eliminated by this low confidence genes (not specifically available for download). The pie chart below summarizes the breakdown of the total gene predictions (SMESG-all) into said sub-fractions.
Just like de novo assembled transcriptomes, de novo predicted gene models are also subject to various potential artifacts. As first quality measure, we compared the representation BUSCO genes of (Waterhouse et al., 2017) genes between the SMESG-repeat filtered predictions (right) and two current S. mediterranea transcriptomes, dd_Smes_v1 (sexual; left) and dd_Smed_v6 (asexual; centre). As shown below, the SMESG predictions slightly outperform the two transcriptomes in terms of complete and missing genes, thus giving a first indication that the cDNA representation of the gene predictions is of a similar quality as that of the de novo assembled transcriptomes.
To obtain a more general measure of the accuracy of the gene predictions, we made use of the set of ~1500 high confidence transcripts that was previously used for the completeness evaluation of the S. mediterranea genome assembly (Grohme et al., 2018 ). As described in the supplemental material of the genome paper, these transcripts were selected on basis of high sequence homology and length coverage in six planarian transcriptomes and thus are likely significantly enriched for full lengths transcripts. The further elimination of transcripts with ambiguous genomic mapping coordinates (i.e., multimappers) and the few transcripts known to be missing from the current assembly left 1322 high confidence cDNAs that we used as “ground truth” for the evaluation of the gene predictions. Specifically, we used the ParsEval tool (Standage et al., 2012) to compare the mapped (GMAP; (Wu et al., 2005)) high confidence transcript set to the SMESG-repeat filtered data set. ~85% of the high confidence transcripts either perfectly matched a gene prediction or a predicted splice isoform, which implies base pair level identity between the experimentally determined and predicted cDNA sequence. 11.7% of the high confidence transcripts were categorized as mismatches. Manual inspection of the mismatch cases revealed a spectrum of underlying causes, from the deletion or addition of a few base pairs to the UTR to exon loss or exon truncations in the respective gene model. The significance of mismatch categorization is further hard to gauge quantitatively, given that even the high confidence transcripts likely harbor assembly mistakes and thus represent an imperfect “ground truth”. Only 10 HC transcripts (0.8 %) did not have a corresponding gene model due to a prediction failure. Overall, this analysis therefore provides a second line of evidence for the accuracy and completeness of our gene predictions. Taking into account the reference uncertainty, we estimate that > 90% of cDNAs are accurately represented by the gene models. Ongoing efforts target the currently missing genes and gene truncations, which will be made via SMESG version updates on PlanMine.
While the evaluation of transcript representation is important, a general evaluation of the gene predictions also requires an evaluation of the extent by which the predictions remedy known problems in de novo assembled transcriptomes.
The fusion of independent cDNAs into a single chimeric transcript (Haas et al. 2013) is one such error that is known to occur in some dd_Smed_v6 transcripts and we defined a blast strategy to identify a set of antisense chimeric transcripts (i.e., with the two reading frames in opposite orientation). Briefly, we sub-selected the longest Trinity assembly isoform of each transcript family and used the EMBOSS getorf tool (Rice et al. 2000) with parameters -find 0, -minsize 300, to select ORFs with at least 100 amino acids. The longest ORF out of nested ORF sets was identified and transcripts with two non overlaping ORF predictions on different strands were selected. Next a BLAST search (Camacho et al., 2009) against the NCBI "nr" database was performed and only individual ORFs with strong BLAST hits retained (e-value <= 1e-30 and query coverage >= 40). Finally, these ORFs were checked for coverage greater than 0 in RNA-Seq dataset (SRX2700685). These criteria identified a total of 150 and 119 putative chimeric transcripts in the dd_Smes_v1 and dd_Smed_v6 transcriptomes, respectively (note that this strategy was not designed to identify all, but a representative subset of antisense chimeras in the current transcriptomes). Transcripts and corresponding gene models were evaluated by manual inspection. Excluding 27 and 11 non-chimeras from dd_Smes_v1 and dd_Smed_v6, ~81% and ~74% of cases were found to be represented by separate and full length gene models. A further ~14% and ~24% were partially fixed, meaning that at least one of the two chimeric ORFs was represented by a reliable Augustus prediction. The graphs below provide a graphical representation of these results for the asexual (top) and sexual (bottom) transcriptomes. Overall, these data suggest that the gene predictions largely remedy the problem of assembly chimeras.
A further common error category in de novo assembled transcriptomes are partial or fragmented transcripts. Using a similar strategy as above, we first designed a search strategy to identify genes that are represented by more than 2 non-overlapping transcripts in dd_Smed_v6 and dd_Smes_v1. Briefly, we selected the longest Trinity assembly isoform for each transcript family and identified open reading frames (ORFs) greater than 100 aa in length (EMBOSS getorf; parameters: -find 0, -minsize 300; (Rice et al. 2000). Where multiple ORF predictions were nested within one another the longest ORF was selected and a blastp search against the Refseq_protein database performed. The results were filtered to retain good blast hits (e-value <1e-20, query_coverage>0.3) where high scoring pairs (HSPs) from two or more planarian ORFs match unique regions of the same RefSeq protein. Additionally results were filtered to exclude ORFs whose parent transcripts mapped to multiple different contigs in the S. mediterranea genome assembly (Grohme et al., 2018 ).
Altogether, this strategy identified 407 cases of transcript fragmentation, which again represents an underestimate of the true incidence of this error category in the transcriptomes. Manual inspection revealed that 337 fragmented transcripts were corrected by SMESG, while 70 events of fragmentation were not corrected Overall, these data indicate a lower degree of ORF fragmentation in the gene predictions as compared to the transcriptomes. This may be especially relevant in the case of lowly expressed genes, which are difficult to assemble.
In parallel with the above analysis, we also carried out a length distribution comparison of primary ORF length (i.e., the longest ORF of the sequence). As shown in the histogram below, both transcriptomes display a strong peak at short ORF length, which likely represents transcript fragments. This peak is not observed in either SMESG-repeat filtered or SMESG-high confidence, which instead peak at the likely mean length of planarian ORFs (shoulder in the transcriptome ORF length distributions). These data further support a reduced level of ORF fragments in the gene predictions as compared to the current transcriptomes.
A cohort of 185 highly conserved proteins (orthogroups) were identified using the orthoMCL tool (Li et al., 2003). Where an orthogroup contained multiple hits from a single species the best BLAST hit was selected based on evalue, bitscore and pident respectively. Protein sequences were aligned using Mafft (Katoh et al., 2013) (v7.402, default settings) and trimmed using trimal (v1.4.rev22) (Capella-Gutierrez et al., 2009) to extract well aligned regions only (-automated -nogaps). These sequences were then concatenated and used as input for the phylogeny inference tools PhyML (v3.3.20180214) (Guindon et al., 2010) (parameter settings: -d aa -q -b 100 -m LG -f e -v e -c 4 -a e -s BEST --rand_start --n_rand_starts 5 -o 'tlr'). The resulting phylogenetic tree is displayed on the PlanMine landing and BLAST pages.
How are they connected? Our transcripts are assigned Gene Ontology (GO) terms based on the terms associated with RefSeq genes from model organisms that show high Blast Homology to our transcript. We would like to get a list of GO terms that are associated with our transcript of interest. In this example I choose dd_Smed_v6_5822_0_1 for illustration purposes.
In order to begin our search we first must choose the data type we would like to use to start our Query. In our case this would be a transcript ID. So we select transcript and press the ‘select’ button.
Upon pressing select we are taken to another page that further allows refinement of our query. The whole data model hierarchy is shown on the left hand side with options to constrain by a field and show the field in the resulting table output. In this example we want to show and constrain (filter) by transcript ID so we click on the blue show button and then the red constrain button next to Id underneath the transcript section.
This brings up the following field where we can filter by our transcript id and click “Add to Query”:
We can now browse the hierarchy to locate the Gene Ontology information that we would like to display that is associated with our transcript ID. If we look down the list we can see a heading called “GO Annotation”, if we click the plus sign this part of the hierarchy expands allowing us to see the sub elements. We are interested in Ontology Terms and so can click on this plus too. We can then click show for Name and Namespace to give the following display:
The right hand side of the screen now shows us that we are constraining by transcript ID dd_Smed_v6_5822_0_1 and would like to display the Name and Namespace of associated GO terms. Underneath we also get a summary of what the resulting table will look like. We can drag columns around below to change column order.
Once we are happy, clicking show results gives the following output table:
The output table can then be saved, exported or further filtered as normal (see the manipulating tabular data section for more information).
Let’s say you want to retrieve all Wnt transcripts from Polycelis tenuis.
Again, note that this result will be entirely based on the parameter settings of our automated transcriptome annotation. Therefore, it is always a good idea to manually verify the search result.
We can used the pre-defined search from the transcript functions tab in the template box on the home page to search for differentially expressed transcripts for certain RNAi experiments.
Clicking on “Gene Knocked Down —> Differentially Expressed Transcripts” takes us to the search configuration page. In our case the defaults are OK as we want to look for transcripts affected by FoxD knockdown. It is important to note that only by selecting a gene used in an RNAi experiment can you currently get differentially expressed transcripts as this feature is still under development. We also select for only differentially expressed transcripts after FoxD knockdown by setting “Gene Expression Analysis Result > Differentially Expressed” to “= TRUE“. Finally we set “Gene Expression Analysis Result > Score” to “> 0“ as in our case we want up-regulated genes after FoxD knockdown.
When we click Show Results we get a results table showing transcripts that are up-regulated under FoxD knockdown under certain conditions. It is possible to further filter and sort this table but for our purposes we just want to get the unique transcript list.
In order to get this unique transcript list we can click on “create new list” and select “All 53 Transcripts” to then go to open a list creation box.
Here we can give our list a suitable name and description for later use. If you are logged into MyMine this list can be saved for later use.
Once the list has been saved you will see a green success bar under the header bar. Clicking on the list name takes us to the newly created list page.
This page shows basic transcript information and it is possible to filter, sort and add more columns, but for our purposes we are most interested in the enrichment analysis widgets below the list.
This enrichment analysis widgets show enriched Gene Ontology terms based on the terms associated with the transcript’s significant BLAST homology assignments. Protein domain enrichment is based on transcript annotated InterPro domains. It is possible to change the background population to any predefined list, to change the exact method for multiple testin correction, the p-value significance cut-off, and in the case of GO enrichment the exact ontology type (i.e. biological process, molecular function, cellular component)
We can now download these lists for later use if desired.
PlanMine is a community effort and we love to receive data for integration into PlanMine. Thank you VERY MUCH for your participation- the more data, the more rewarding the mining for all of us… As always, time is a precious resource and we therefore ask you for help in the submission/integration process. The text below specifies the submission details for 1) RNAseq dataset submissions; 2) Transcriptome assemblies.
Please note that we can only integrate datasets that have been deposited to the Short Reads Archive (SRA). SRA deposition is in any case a prerequisite for publication and we rely critically on their raw data storage and archiving services, which we cannot replicate at the required scale. To integrate new data sets, we need to know i) where the raw sequencing reads are being stored and ii) experimental details pertaining to the data set. The attached Excel form queries the relevant bits of information. Please e-mail the completed form as attachment to planmine@mpi-cbg.de. Use the same e-mail address if you encounter any problems in the process.
The submission form background:
For integrating your transcriptome assembly into PlanMine, we need i) the actual assembly in form of a fasta file; ii) background information on the raw data and assembly techniques that were used for generating the assembly. For i), please drop an e-mail to planmine@mpi-cbg.de to arrange transfer possibilities (ftp server, dropbox or similar). For ii), please complete the attached questionnaire. Please enter as much information as possible as this will greatly increase the utility of the transcriptome.