Thursday, January 14, 2016

Guest post: Sample size estimation for DNA barcoding: Are current sampling levels enough?

Today a guest post by Jarrett Philips. He writes about a project of his Masters studies in Bioinformatics which he also presented as a poster contribution at the 6th International Barcode of Life Conference last August. Enjoy and many thanks to Jarrett for his contribution.

The ability of DNA barcodes to detect meaningful genetic variation within and between species is strongly influenced by the scale of specimen sampling. Unfortunately, global barcoding efforts have only been partially successful in this regard due to the majority of studies forgoing deep taxon sampling in favour of optimizing the number of taxa sampled. 

A practical sample size of five individuals per species is common in barcoding studies, but such a strategy is by no means sufficient. This has led to sampling schemes in which many more specimens per species are collected.

To do this, they developed a simple quantitative model to predict total sample sizes given estimates of observed specimen as well as observed COI haplotype numbers and total haplotype diversity for a species. In creating their model, one very important assumption was made: that haplotypes occur at equal frequency within species populations. Such an assumption is not biologically realistic since species abundances are often skewed geographically.

The authors found wide-ranging sample sizes (between 150-5400 individuals/species) are likely needed to uncover all haplotype diversity across 18 selected species comprising freshwater, marine and migratory ray-finned fishes (Chordata: Actinopterygii). This is a far cry from sampling intensities currently employed in many barcoding initiatives; however such numbers may not be practical and further investigation will be required to fully probe the extent of sampling necessary to gauge existing species genetic diversity in this group and others. 
Haplotype accumulation curves and frequency histograms for four species selected to show a range of sample sizes and haplotype diversity. Those species whose curves show rapid saturation indicate that much of the intraspecific haplotype diversity may have been sampled. Species curves showing little to no indication of asymptotic behavior suggest further sampling is necessary. 

The final paragraph of their study is particularly motivating:

We recognize that estimates of N* calculated from our model likely represent underestimates of the true number of individuals of a given species which should be sampled. Many more specimens should therefore be sampled in order to ensure a sufficient number of haplotypes have been recovered. Equal haplotype frequencies are rarely observed in natural populations, and we suggest the development of more sophisticated models should explore the use of data simulations to evolve models that explicitly account for variation in species haplotype frequencies.

1 comment:

  1. That looks interesting...but perhaps a more widespread use of rarefaction curves (or "accumulation curves") might be a good solution? I understand that one needs to have some starting point to design sampling, but perhaps rarefaction curves could help deciding if more sampling is necessary...
    I guess it would also partially boil down to the final aim of a study...