It is time for sin number four in the series on the Collins and Cruickshank paper.
Oh yeah, that one. I just think the focus lies on the wrong thing here. The issue is rather the use of trees and not the method to calculate them. From day one of DNA Barcoding trees have been used to visualize that datasets can be separated into coherent groups mostly corresponding to species. In hindsight that might be considered a mistake as people often associate the depiction of a tree with phylogenetics. However, in most cases there is no assumed direct connection between sequence similarity and evolutionary relationship, nor are NJ-trees in many DNA Barcoding studies intended to infer phylogenetic relationships. The way a barcoder should look at such trees is to focus on the terminals (and their respective homogeneity) and to ignore the remainder of the topology. Lets also not forget that we are looking single, mostly short DNA fragments that exhibit only limited phylogenetic signal. Again, limitations attributed to the choice of neigbour-joining as method are actually caused by an entirely different problem.
Collins and Cruickshank admit that it is important to note at this point that problems with NJ trees are not resolved by using any other tree inference method such as maximum likelihood or parsimony. The problem is with relying on tree topologies and monophyly (in a topological rather than cladistic sense) as an identification criterion. In situations of incomplete lineage sorting and species-level paraphyly, tree-based identification methods will result in ambiguous or incorrect identifications.
We had a seminar in our department and discussed the paper among faculty, postdocs, and students. One thing that came up during our discussions on sin #4 was that we need to be consequent. The use of trees in our studies as visualization tool should be clearly stated as such and we need to avoid to discuss findings using phylogenetic terminology such as monophyletic, paraphyletic etc. The tree is a cluster diagram at best and should be treated accordingly.
That leaves us with an important question. What alternatives to trees do we have to visualize DNA Barcoding data at large? This becomes even more important in the light of the ever growing datasets at hand. Trees are seriously limited when it comes to the display of thousands of terminals. Collins and Cruickshank provide some suggestions of tree-free methods.
People in the DNA barcoding community have been thinking about this already for a while. In 2006 Mark Stoeckle provided a few suggestions to visualize large datasets in his blog - and he continued to provide new ideas that might overcome the limitation of trees especially when it comes to large amounts of data. Probably the most promising approach was the use of a heat map which due to its unique appearance was named Klee-diagram. The method is based on the transformation of DNA sequence information into binary vectors.
A heat map of the correlations of the vectors provides a visual display among sequences. What I like most about the heat maps, false-color maps, or Klee-diagrams is that they are powerful visualizations of large data sets that do not require a lot of space. They are scalable and it doesn't really matter what the input is. If you prefer vectorized sequences or distance values it won't matter. 2011 I gave a talk on the potential of this method.
|Comparison of a tree and a heat map depicting the |
same data set (470 COI sequences of fish species)
I've stated it above - this is not why so many papers show NJ-Trees for DNA Barcodes. Mostly it is meant to show the cohesiveness of species groups and a tree is certainly a very intuitive way of doing that. In addition NJ is computationally the fastest way. All other methods proposed at this point (at least the tree building ones) are far too slow to provide an alternative.
So, whats the take-home message. Neighbour joining is for sure not a perfect method although everything depends on what our intention is. When it comes to writing up the result section in our manuscript we should ask us the question what we want to present and we need to state that clearly. If it is supposed to be a visualization of our dataset there are alternatives to trees especially if we have a large dataset. If we intend to do phylogenetics we need to think about alternatives. In such a case I usually don't bother about distance or character-based methods and use other more comprehensive approaches (e.g. Maximum Likelihood).There is just one problem - one marker such as COI is usually not enough to do phylogenetics which means back to the drawing board.