Wednesday, March 11, 2015

DNA Barcode Conference Plenary - Rod Page

We continue in the series of guest posts by plenary speakers of the 6th International Barcode of Life Conference. Today we have Rod Page introducing some of his work which is also very relevant to DNA barcoding.

Rod is perhaps best known as the author of the phylogenetic visualisation program “TreeView”, and more recently his “iPhylo” blog, He started out as a crustacean taxonomist, before being swept up in the debates on panbiogeography and vicariance biogeography that raged in the 1980’s and 90’s. After gaining a PhD at Auckland University, New Zealand, he worked as a post doc at the Amercian Museum of Natural History in New York, and The Natural History Museum in London, before taking up a lectureship at the University of Oxford. Since 1995 Rod has been at the University of Glasgow, where he is Professor of Taxonomy. A past editor of Systematic Biology, he is currently Chair of the GBIF Science Committee. His current work focuses on linking together biodiversity data from diverse sources:

One of the things I'm most interested in is the challenge of linking together diverse source of biodiversity data into a "biodiversity knowledge graph". I've explored aspects of this problem on my blog iPhylo, and here I've picked out some posts on iPhylo that are most relevant to DNA barcoding.

"Dark taxa" are taxa on GenBank that lack full scientific names, and as I showed in 2011 (Dark taxa: GenBank in a post-taxonomic world) they are becoming more common in GenBank, thanks in no small part to the increase in DNA barcoding sequences being deposited in that database. Dark taxa make it problematic to link biodiversity data using taxonomic names alone, so much so that GenBank has ended up suppressing some barcodes (see Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank). 

The dark taxa problem illustrates the difficulties in integrating data from different sources, particularly if the sources have different ways of handling the "same" data. Another example is the recent addition of sequence data to the Global Biodiversity Information Facility (GBIF). There is enormous scope for cross-linking the BOLD, GenBank, and GBIF databases, but there are some challenges to resolve when doing this. Although some 4 million geotagged DNA sequences have been added to GBIF (many of which are DNA barcodes), this does not represent 4 million unique observations of where organisms are distributed. Many sequences in GenBank come from the same specimen, for example a museum voucher may be sequenced for a DNA barcode, as well as other mitochondrial and nuclear genes. In extreme cases, shotgun  sequencing projects can generate hundreds of thousands of sequences from the same specimen. Indeed, GBIF now has 551,919 records for the dwarf beech Betula nana which correspond to shotgun sequences from a single specimen of that plant collected in Scotland! A major challenge will be aggregating sequences from the same samples into single occurrence records that can be integrated into GBIF.

In Displaying a million DNA barcodes on Google Maps using CouchDB I describe a method for making an interactive map of DNA barcodes using Google Maps and the BOLD API. The result is live at The dots on the map are clickable, so you can get more details on the barcodes found at any point on the planet. In response to user feedback I've added the ability to filter the barcodes by taxonomic group.

I've long been interested in the visualisation of biodiversity data, and recently have started playing with putting molecular phylogenies onto web-based maps by converting trees into a data format called "GeoJSON", which is widely used by web-based mapping tools (see GeoJSON and geophylogenies). Because most DNA barcodes come with known latitudes and longitudes, they are a great source of data for this method. For example, below is a "geophylogeny" for barcodes from Proechimys guyannensis and its relatives. Sequences from the same BIN have the same colour, which helps make visible the phylogeographic structure in the data.

No comments:

Post a Comment