Tuesday, May 10, 2016

Bad barcoding

A tree that shouldn't be
In order to find new material for this blog and to stay on top of all things DNA barcoding I look at new publications on at least a weekly basis. Sometimes I come across some papers that cause reactions such as "Are you kiddin' me?" or other much stronger expressions that shouldn't be published in a blog like this one. Today was such a day.

Let me start with saying that is was disappointed to find such an article in a respected journal such as Mitochondrial DNA. I have published there myself and I can't understand how this article could make it through editorial and peer review.

So, let me introduce this one to you: 

This paper discusses apparent confusion of two different COI fragments as if this is actually a real problem. Let me give you an example: For decapod crustaceans, there are two major COI barcode regions (namely Folmer and Palumbi), which are non-homologous with limited overlaps. NO, there are not two major barcode regions used for decapods, there is just one - the Folmer one. The other is indeed a different region of the COI gene. That's not a big find. In fact this is known to everyone who does either DNA barcoding or molecular evolutionary studies. Such textbook knowledge does not warrant a full blown scientific study but here it is. The authors did not stop there and that's what makes the rest of the paper so bad even if the intentions were perhaps well meant. 

Knowing that there are two different regions of the COI gene the authors went ahead 'validating' this by sequencing both from a decapod species and attempting some sort of phylogenetic analysis. But, unexpected results were drawn out as the COI gene tree exhibited reciprocal monophyly in its array, differentiating ‘‘Folmer’’ and ‘‘Palumbi’’ regions as two different entities. Really? A surprise? The only surprise to me is that they managed to align both of them. Well, that is not a surprise but rather a mystery as you can't properly align these two different parts. The overlap is minimal (<30bp). And, no matter what you throw at such a fundamentally flawed dataset (Maximum Likelihood, Neighbor joining, Minimum evolution) you will indeed get two very 'divergent' lineages in your resulting tree. The authors even went on to test the utility of both markers by generating trees from alignments that contained each set of sequences. You simply can not do that. 

How can such a study be published? First, the authors have not bothered to read the literature that clearly differentiates markers and with respect to the DNA barcode for animals clearly defines the region to be used. I have stated this repeatedly - you cannot call everything a DNA barcode. That is the opposite of standardization. Even worse, this study is a textbook example for comparing apples and oranges in the DNA analysis world. I still don't understand how they could take two different regions and attempted to align and analyse them. To show that it doesn't work? Well, then they shouldn't show a tree at all as any proper analysis fails much earlier in the process. However, the worst part of this story is the fact that it got published meaning it slipped through the entire peer review process. A reviewer should at least have pointed out the heavily flawed methods even if some of the conclusions are perhaps correct.

To be fair, in one case I have to agree with the authors of this study - there are some instances where sequences were uploaded to public databases such as GenBank and these are clearly not standard DNA barcodes but something else. They are not labelled properly and could cause confusion in case somebody does a name-based search and retrieves a mixture of different COI regions. However, that is not a problem of the method DNA barcoding but rather one of people that do not understand how it works.

1 comment:

  1. I agree with you, it is hard to understand how this can be published. As to how they managed to align the sequences, I guess they didn't. Did you see the pairwise distances between the two region, it's more than 1, i.e. there are more differences then bases when using a K2P model