hide
Free keywords:
reciprocal best hit; conditional reciprocal best hit; codon alignment; Ka/Ks; dN/dS; tandem duplicated genes; synteny
Abstract:
CRBHitsis a coding sequence (CDS) analysis pipeline inR(R Core Team, 2019). It reimple-ments the Conditional Reciprocal Best Hit (CRBH) algorithmcrb-blastand covers all necessarysteps from sequence similarity searches, codon alignments to Ka/Ks calculations and synteny.The new R package targets ecology, population and evolutionary biologists working in thefield of comparative genomics.The Reciprocal Best Hit (RBH) approach is commonly used in bioinformatics to show thattwo sequences evolved from a common ancestral gene. In other words, RBH tries to findorthologous protein sequences within and between species. These orthologous sequencescan be further analysed to evaluate protein family evolution, infer phylogenetic trees and toannotate protein function (Altenhoff et al., 2019). The initial sequence search step is classicallyperformed with the Basic Local Alignment Search Tool (blast) (Altschul et al., 1990) and dueto evolutionary constraints, in most cases protein coding sequences are compared betweentwo species. Downstream analysis use the resulting RBH to cluster sequence pairs and buildso-called orthologous groups like e.g.OrthoFinder(Emms & Kelly, 2015) and other tools.The CRBH algorithm was introduced byAubry et al.(2014) and builds upon the traditionalRBH approach to find additional orthologous sequences between two sets of sequences. Asdescribed earlier (Aubry et al., 2014;Scott, 2017), CRBH uses the sequence search results tofit an expect value (E-value) cutoff given each RBH to subsequently add sequence pairs tothe list of bona-fide orthologs given their alignment length.Unfortunately, as mentioned byScott(2017), the original implementation of CRBH (crb-blast)lag improved blast-like search algorithm to speed up the analysis. As a consequence,Scott(2017) ported CRBH to pythonshmlast, whileshmlastcannot deal with IUPAC nucleotidecode so far.CRBHitsconstitutes a new R package, which build upon previous implementations and portsCRBH into theRenvironment, which is popular among biologists.CRBHitsimprove CRBHby additional implemented filter steps (Rost, 1999) and the possibility to apply custom filtersprior E-value fitting. Further, the resulting CRBH pairs can be evaluated for the presence oftandem duplicated genes, gene order based syntenic groups and evolutionary rates.