In a new study researchers from Honolulu assess substitution model adequacy for describing genetic variation and for estimating species richness from barcoding data. One of their main motivation for this is the almost ubiquitous use of the Kimura 2-parameter (K2P) model. Earlier studies have already shown that this is perhaps a poor choice because it doesn't always fit well at the species level and provided unreliable estimates of the number of OTUs.
What the colleagues did was first to assemble many (more than 2000) empirical data sets. For each of those they selected a ‘best fit’ model of molecular evolution and performed DNA barcoding analyses under both the chosen model and the K2P model to estimate the number of OTUs. For the latter they used either ABGD or hCluster. Finally, they conducted a Bayesian phylogenetic inference and posterior predictive simulation to assess the fit of each substitution model to each data set.
Not surprisingly the more complex model seem to fit better. The most frequently selected model for the barcode data sets was GTR + Γ, followed by HKY + Γ. The model adequacy assessments showed that the K2P model was found to be within the 95% highest posterior density of only 3% of all datasets.
But the more interesting find is that depending on the method and threshold employed, the total number of OTUs varied considerably (4%–31%) meaning that model choice has a substantial impact on the number of operational taxonomic units identified.
Take home message:
We demonstrate that the widely followed practice of a priori assuming the Kimura 2-parameter model for DNA barcoding is statistically unjustified and should be avoided. Using both data-based and inference-based test statistics, we detect variation in model performance across taxonomic groups, clustering algorithms, genetic divergence thresholds and substitution models. Taken together, these results illustrate the importance of considering both model selection and model adequacy in studies quantifying biodiversity.