Thursday, November 20, 2014

Type material on GenBank

I consider it very good news - GenBank now includes annotation of type material:

Type material is the taxonomic device that ties formal names to the physical specimens that serve as exemplars for the species. For the prokaryotes these are strains submitted to the culture collections; for the eukaryotes they are specimens submitted to museums or herbaria. The NCBI Taxonomy Database  now includes annotation of type material that we use to flag sequences from type in GenBank and in Genomes. 

In a recently published paper Scott Federhen from GenBank provides all the necessary details to this new feature. GenBank has indexed the Nucleotide domain of their Entrez database with ‘sequence from type [filter]’ which allows the user to retrieve sequence entries that are derived from type material. At the time of publication of the paper they listed 72 750 type sequences from different genes, representing 18 847 different species. These are available in the taxonomy dump files on our FTP site, and are searchable in Taxonomy Entrez and in the taxonomy browser.

From my research for my weekly column on new species descriptions I know how few of them actually make use of genetic data let alone DNA Barcodes. I am sure there is a plethora of reasons for this. It certainly would be too easy and almost an excuse to always blame it on the lack of funds. Very often it might also be the lack of any incentive. If I can publish a regular fully valid species description solely using morphology why would I want to add e.g. DNA Barcoding to it? It costs more, it is extra work, and even worse, some of my colleagues would actually criticize me for doing it. So, why are there taxonomists doing it anyway? The answer is rather simple - once they used it they started seeing the value in it. First and foremost it makes their work easier and allows them to focus on discovery. Increasingly often, molecular genetics leads to discovery of hitherto unrecognized species. A secondary effect is what it means for the scientific community at large. 

One key element of the Barcode standard that has been adopted by GenBank back in 2005 is an unambiguous link to the voucher specimen from which the Barcode was derived. If this voucher happens to be a type specimen the associated DNA Barcode is tightly linked to the original species description. Provided that all other requirements of the standard are met this becomes an ideal DNA Barcode. I did a very quick search on GenBank using the new annotation filter for types and the reserved keyword BARCODE and found 158 sequences for 90 species. Not a lot but a good start also given that this is a new feature. Not all sequences have been assigned the type designation. I just looked at one extreme example, the Butcher et al. paper from 2012 in which 179 new species of parasitic wasps were described largely using DNA Barcodes. The sequences are on GenBank and they have the magic BARCODE keyword but the connection to the types hasn't been made yet. 

It is probably time to look at all our GenBank submissions and check if they need an update. They could be from a type and deserve the extra recognition. Lots of work to do, but certainly worth the effort as Scott also states in his conclusions:

Sequence from type is a high-value subset of GenBank for which we can maintain a very high level of confidence in the taxonomic identification. Nomenclatural acts involving type material are carefully documented in the taxonomic literature, so we can reasonably hope to keep these identifications current...Diligent curation of sequences from type material in GenBank as outlined above can make this set even more reliable. Species with problematic taxonomy are still problematic, but egregious misidentifications can be found and corrected. This does not solve the more general problem of misidentified entries in GenBank, but does provide a reliable backbone of correctly identified entries that could help support a more general solution.

