A new publication appeared in PLoS ONE proposes a direct way of assessing sequence errors in published records.
The authors tested the hypothesis that sequencing errors in reference barcodes can be detected as very low frequency variants at sequence positions that are otherwise highly conserved. They used their approach to assess sequencing error in a large dataset of bird sequences (11,333 sequences from 2706 species) which they obtained at GenBank. They used both sequences with the keyword "BARCODE" and such without. The keyword indicates that the DNA Barcodes follow the so called barcode-standard with a minimum of 500 bp from a defined region (COI in animals), linkage to museum specimens, and publicly archived trace files documenting a minimum quality score.
I think this study shows three very interesting results:
|Prevalence of VLFs over time (Stoeckle & Kerr 2012)|
(1) The very low frequency variants (VLF) detected in single individuals of a species were mostly concentrated at the ends of the barcode sequence, consistent with sequencing error. The maximum error rate is estimated at approximately 0.05 errors per barcode sequence.
(2) The method was able to recognize a number of overlooked cryptic pseudogenes lacking stop codons which are usually the best indicator for those.
(3) The high overall quality of the dataset especially in comparison with COI sequences deposited before implementation of the barcode standard thereby supporting its effectiveness.
A very nice paper with most analytics done in Excel - Can't wait to have a look at some of my fish datasets.