One of the main activities of the DNA Barcoding community is to generate an open access library of reference barcode sequences which enables everyone with the ability to obtain sequence information
to identify specimens. Barcodes of unidentified specimens can be compared with reference barcodes to find the matching species.
For over ten years we have already generated almost 4 Million barcodes for over 400 000 species. Many of these species have been selected because they are of special interest to particular users who need the ability to identify "their" species. However, as a result of our progress more and more we are looking at the completion of larger reference libraries for particular taxonomic groups either world-wide or more limited in range such as national libraries. Building such barcode reference libraries in a systematic fashion can be challenging and comes with its own set of potential issues.
A new study published in PLoSONE last week uses the work on the DNA Barcode library for the Hemiptera of Canada as an example to look at strategies to construct reference libraries:
We discuss the development of our workflow in the context of prior DNA barcode library construction projects, emphasizing the importance of delineating a set of reference specimens to aid investigations in cases of nomenclatural and DNA barcode discordance. The identification for each specimen in the reference set can be annotated on the Barcode of Life Data System (BOLD), allowing experts to highlight questionable identifications; annotations can be added by any registered user of BOLD, and instructions for this are provided.
The colleagues define reference records as publicly available DNA Barcode sequences linked to voucher specimens via BOLD that:
- meet DNA barcode data standards
- are identified to species listed in a taxonomic catalogue
- are specified using a Digital Object Identifier
The resulting library of Canadian Hemiptera covers about 45% of the recognized species (~1850) based on samples of 20,851 specimens belonging to 628 genera and 64 families. The total sampling yielded records for 54,280 specimens, which included 2671 species. These include specimens collected in Canada, in BINs without any named specimens. As such they represent possible new species records on BOLD but they have not been included in the library dataset until further detail becomes available.
As often, the last words of my post belong to the authors of the study:
Lastly, we recognize the multimodal nature of biodiversity inference can be unifying. DNA sequences without names, specimens without sequences, names that are synonymies—can all be reciprocally illuminating in a comparative framework. Conventionally, these frameworks have been important, but isolated works: catalogues, monographs, revisions, descriptions, sequences, and their epistemological foundation—specimens. As demonstrated here, an infrastructure that digitally aggregates specimens in a DNA-mediated reference library can unite these isolated resources into open-sourced ‘virtual unit trays’ of specimens, and sequences, from across the world’s collections. Such an effort integrates isolated resources to create a shared understanding about taxon diversity.
Communicating about how we evolve such an infrastructure is a frontier in biodiversity science. And we share the ideas in this paper to help evolve inference methods that are at once public, repeatable, collaborative, and comparative.