Today a post about new developments from the world of DNA barcoding informatics. I selected three publications of the last few months that actually provide some new package worth to be tested by the community. Without further ado my little collection of new bioinformatic releases.
This idea starts with the notion that a modern DNA barcoding approach should incorporate the multispecies coalescent. The multispecies coalescent model was developed as a framework to infer species phylogenies from multilocus genetic data collected from multiple individuals. It assumes that speciation occurs at a specific point in time, after which two new species evolve in total isolation. However, in reality speciation may occur over an extended period of time, during which the two sister lineages likely remain in some sort of contact. Inferring phylogenies with multiple species under those conditions is actually very difficult and requires a fair amount of computation time. Using the approach with DNA Barcode data is a little simpler as one element of complexity has been removed by using only one gene region or as in this publication just two as the authors make the following bold statement: recent developments make a barcoding approach that utilizes a single locus outdated. I beg to differ as I don't see the point of sequencing a plethora of loci when one does the trick already but that is a topic for another post sometime. The nice thing about this approach is the fact that it utilizes already existing software and algorithms. Everyone familiar with this approach should find it easy to follow their recipe that uses BPP (Yang and Rannala 2010) and *BEAST to produce a guide tree for the subsequent BPP analysis.
This coalescent-based *BEAST/BPP approach was used to identify species boundaries. The colleagues used a test set of Sarcophaga species to compare a distance based approach with their new method: We found that, of the 31 species of Sarcophaga examined..., 27 could be reliably distinguished by barcoding when a 4% sequence divergence threshold was applied. The four problematic taxa were S. megafilosia, S. meiofilosia, S. crassipalpis and S ruficornis. S. megafilosia and S. meiofilosia had an interspecific divergence of 2.81%, while S. crassipalpis and S ruficornis had an interspecific divergence of 3.75%. The success rate of barcoding for this set of taxa is thus 87%, while the *BEAST/BPP approach had a success rate of 100%.
The only question I have is why divergence values of 2.81% and 3.75% where considered problematic in the first place. That can only happen if a fixed value is used to define species boundaries. Who does that?
ExcaliBAR is a small routine to facilitate one important initial step in DNA Barcoding analyses, namely the determination of the barcoding gap between pairwise genetic distances among and within species, based on original distance matrices computed by MEGA software. In addition, the software is able to rename sequences downloaded via the standard user interfaces of public databases such as GenBank, without the need of developing and applying specific scripts for this purpose.
This is an interesting little tool although I have to admit that aside from the very useful renaming of sequence names which make the resulting file compatible with other software I don't see the full advantage. From what I understood reading the paper the routine takes a MEGA output containing a pairwise distance matrix. ExcaliBAR then calculates intra- or interspecific pairwise distances that can be exported e.g. into Excel to determine a threshold above which sequences are likely to represent different species. The authors claim that the program is actually performing better than other software such as ABGD or SpideR. They even have the guts to state that similar to the other program the ‘Barcode Gap Analysis’ option on BOLD was not devised to handle large datasets. I beg to differ as BOLD is probably still the best option available to deal with large datasets and it has been criticized for using distance based methods to accomplish this. ExcaliBAR still needs about 30 min to process a matrix generated from 5000 DNA Barcodes. Not bad but are they really better than others?
One criticism provided in the publication on ExcaliBAR was about the fact that some programs are using R. R is a free software programming language and software environment for statistical computing and graphics, and yes, there is a bit of a learning curve involved to develop the mastery of using it properly.
Adhoc is a new method to deal with incomplete reference libraries of DNA barcodes is based on ad hoc distance thresholds that are calculated for each library considering the estimated probability of relative identification errors. By using each sequence of a reference library as a query against all other reference sequences the program can calculate the relative identification error (RE) of the best close match method. Prior to that Adhoc generates some basic descriptive statistics of the imported dataset providing two tables containing species names, full sequence identifiers, and numbers of sequences and haplotypes for each species. It also returns the length of each reference sequence, calculates all pairwise distances and separates intra- and interspecific pairwise comparisons. In their publication the authors also provide a very important disclaimer:
This method has been developed for specimen identification. It is intended to optimise the identification success rate by adapting the distance threshold according to a RE estimated from a particular reference library. Hence, using this method for species delimitation requires a careful interpretation of the output.
I think that disclaimer should be found on every bioinformatic tool for DNA Barcoding.