Friday, June 22, 2018

Weekend reads

Here we go again, another week has passed quickly. Light on posting, mainly because I had some days off and no chance to do digging for blog posts. Nevertheless, here your weekly dose of interesting papers. Really good stuff.

Genetic taxonomic assignment can be more sensitive than morphological taxonomic assignment, particularly for small, cryptic or rare species. Sequence processing is essential to taxonomic assignment, but can also produce errors because optimal parameters are not known a priori. Here, we explored how sequence processing parameters influence taxonomic assignment of 18S sequences from bulk zooplankton samples produced by 454 pyrosequencing. We optimized a sequence processing pipeline for two common research goals, estimation of species richness and early detection of aquatic invasive species (AIS), and then tested most optimal models' performances through simulations. We tested 1,050 parameter sets on 18S sequences from 20 AIS to determine optimal parameters for each research goal. We tested optimized pipelines' performances (detectability and sensitivity) by computationally inoculating sequences of 20 AIS into ten bulk zooplankton samples from ports across Canada. We found that optimal parameter selection generally depends on the research goal. However, regardless of research goal, we found that metazoan 18S sequences produced by 454 pyrosequencing should be trimmed to 375-400 bp and sequence quality filtering should be relaxed (1.5 ≤ maximum expected error ≤ 3.0, Phred score = 10). Clustering and denoising were only viable for estimating species richness, because these processing steps made some species undetectable at low sequence abundances which would not be useful for early detection of AIS. With parameter sets optimized for early detection of AIS, 90% of AIS were detected with fewer than 11 target sequences, regardless of whether clustering or denoising was used. Despite developments in next-generation sequencing, sequence processing remains an important issue owing to difficulties in balancing false-positive and false-negative errors in metabarcoding data.

DNA metabarcoding has been introduced as a revolutionary way to identify organisms and monitor ecosystems. However, the potential of this approach for biomonitoring remains partially unfulfilled because a significant part of the sampled DNA cannot be affiliated to species due to incomplete reference libraries. Thus, biotic indices which are based on the estimated abundances of species in a community and their ecological profiles can be inaccurate. We propose to compute biotic indices using phylogenetic imputation of OTUs' ecological profiles (OTU-PITI approach). Firstly, OTUs sequences are inserted within a reference phylogeny. Secondly, OTUs' ecological profiles are estimated on the basis of their phylogenetic relationships with reference species whose ecology is known. Based on these ecological profiles, biotic indices can be computed using all available OTUs. Using freshwater diatoms as a case study, we show that short DNA barcodes can be placed accurately within a phylogeny and their ecological preferences estimated with a satisfactory level of precision. In light of these results, we tested the approach with a dataset of 139 environmental samples of benthic river diatoms for which the same biotic index (IPS) was calculated using (i) traditional microscopy, (ii) OTUs with taxonomic assignment approach, (iii) OTUs with phylogenetic estimation of ecological profiles (OTU-PITI), and (iv) OTU with taxonomic assignment completed by the phylogenetic approach (OTU-PITI) for unclassified OTUs. Using traditional microscopy as a reference, we found that the combination of the OTUs' taxonomic assignment completed by the phylogenetic method performed satisfactorily and substantially better than the other methods tested.

BACKGROUND: High throughput DNA sequencing of bulk invertebrate samples or metabarcoding is becoming increasingly used to provide profiles of biological communities for environmental monitoring. As metabarcoding becomes more widely applied, new reference DNA barcodes linked to individual specimens identified by taxonomists are needed. This can be achieved through using DNA extraction methods that are not only suitable for metabarcoding but also for building reference DNA barcode libraries.
METHODS: In this study, we test the suitability of a rapid non-destructive DNA extraction method for metabarcoding of freshwater invertebrate samples.
RESULTS: This method resulted in detection of taxa from many taxonomic groups, comparable to results obtained with two other tissue-based extraction methods. Most taxa could also be successfully used for subsequent individual-based DNA barcoding and taxonomic identification. The method was successfully applied to field-collected invertebrate samples stored for taxonomic studies in 70% ethanol at room temperature, a commonly used storage method for freshwater samples.
DISCUSSION: With further refinement and testing, non-destructive extraction has the potential to rapidly characterise species biodiversity in invertebrate samples, while preserving specimens for taxonomic investigation.

Marine plankton populate 70% of Earth's surface, providing the energy that fuels ocean food webs and contributing to global biogeochemical cycles. Plankton communities are extremely diverse and geographically variable, and are overwhelmingly composed of low-abundance species. The role of this rare biosphere and its ecological underpinnings are however still unclear. Here, we analyse the extensive dataset generated by the Tara Oceans expedition for marine microbial eukaryotes (protists) and use an adaptive algorithm to explore how metabarcoding-based abundance distributions vary across plankton communities in the global ocean. We show that the decay in abundance of non-dominant operational taxonomic units, which comprise over 99% of local richness, is commonly governed by a power-law. Despite the high spatial turnover in species composition, the power-law exponent varies by less than 10% across locations and shows no biogeographical signature, but is weakly modulated by cell size. Such striking regularity suggests that the assembly of plankton communities in the dynamic and highly variable ocean environment is governed by large-scale ubiquitous processes. Understanding their origin and impact on plankton ecology will be important for evaluating the resilience of marine biodiversity in a changing ocean.

MOTIVATION: Correct taxonomic identification of DNA sequences is central to studies of biodiversity using both shotgun metagenomic and metabarcoding approaches. However, no genetic marker gives sufficient performance across all the biological kingdoms, hampering studies of taxonomic diversity in many groups of organisms. This has led to the adoption of a range of genetic markers for DNA metabarcoding. While many taxonomic classification software tools can be re-trained on these genetic markers, they are often designed with assumptions that impair their utility on genes other than the SSU and LSU rRNA. Here, we present an update to Metaxa2 that enables the use of any genetic marker for taxonomic classification of metagenome and amplicon sequence data.
RESULTS: We evaluated the Metaxa2 Database Builder on eleven commonly used barcoding regions and found that while there are wide differences in performance between different genetic markers, our software performs satisfactorily provided that the input taxonomy and sequence data are of high quality.
AVAILABILITY: Freely available on the web as part of the Metaxa2 package at

BACKGROUND: The world's herbaria contain millions of specimens, collected and named by thousands of researchers, over hundreds of years. However, this treasure has remained largely inaccessible to genetic studies, because of both generally limited success of DNA extraction and the challenges associated with PCR-amplifying highly degraded DNA. In today's next-generation sequencing world, opportunities and prospects for historical DNA have changed dramatically, as most NGS methods are actually designed for taking short fragmented DNA molecules as templates.
RESULTS: As a practical test of routine recovery of rDNA and plastid genome sequences from herbarium specimens, we sequenced 25 herbarium specimens up to 80 years old from 16 different Angiosperm families. Paired-end reads were generated, yielding successful plastid genome assemblies for 23 species and nuclear rDNAs for 24 species, respectively. These data showed that genome skimming can be used to generate genomic information from herbarium specimens as old as 80 years and using as little as 500 pg of degraded starting DNA.
CONCLUSIONS: The routine plastome sequencing from herbarium specimens is feasible and cost-effective (compare with Sanger sequencing or plastome-enrichment approaches), and can be performed with limited sample destruction.

The increasing popularity of cytochrome c oxidase subunit 1 (COI) DNA metabarcoding warrants a careful look at the underlying reference databases used to make high-throughput taxonomic assignments. The objectives of this study are to document trends and assess the future usability of COI records for metabarcode identification. Over 2.5 million COI sequences were found in GenBank, half of which were fully identified to the species rank. From 2003 to 2017, the number of COI Eukaryote records deposited has grown by two orders of magnitude representing a nearly 42-fold increase in unique species. For fully identified records, 92% are at least 500 bp in length, 74% have a country annotation, and 51% have latitude-longitude annotations. To ensure the future usability of COI records in GenBank we suggest: 1) Improving the geographic representation of COI records 2) Improving the cross-referencing of COI records in the Barcode of Life Data System and GenBank to facilitate consolidation and incorporation into existing bioinformatic pipelines, 3) Adherence to the minimum information about a marker gene sequence guidelines, and 4) Integrating metabarcodes from eDNA and mixed community studies with existing sequences. COI metabarcoders are normally considered consumers of taxonomic data. Here we discuss the potential for taxonomists to reverse this pattern and instead mine metabarcode data to guide species discovery. The growth of COI reference records over the past 15 years has been substantial and is likely to be a resource across many fields for years to come.

Thursday, June 14, 2018


Ever seen anything in relation to the hashtag #BadStockPhotosOfMyJob? If not you should check out Twitter or search for it on Google because it really shows some ridiculously funny photos that exhibit some of the worst stereotypes people have when thinking about other's jobs. Especially the perception of what scientists do is almost tragic. I thought its a good idea to show a few examples including ironic comments by the real scientists. It's funny indeed but sometimes also just sad to see what others think we scientists do for a living.


  I have no words for those four.

Wednesday, June 13, 2018

Interview with a vampire

In this study, we show for the first time that it is possible to use DNA meta-barcoding to generate data on both diet and the predator's population structure. And we more or less get this additional information for free because the vampire bat's DNA is found in the DNA that we extract from blood meal and faecal samples

When the sun sets in South and Central America, the vampire bats wake up and fly out in search of prey. The vampire bat's diet consists of blood. It prefers to feed on domestic animals such as cows and pigs, but when it does so, there is a risk of transmitting pathogens such as rabies. In order to control rabies transmitted by vampire bats, it is crucial to have a method that allows large-scale assessment of vampire bat prey. A study published back in April led by researchers from Denmark and the UK, shows that metabarcoding can do just that.

The colleagues analysed vampire bat blood meal and faecal samples collected in Peru, along the coast, in the Andes and in the Amazon. In diet studies, the metabarcoding is normally only used to assess diet, but in this study, the researchers went one step further and gathered information on the vampire bat's population structure. The latter is an approach very similar to work my group has been doing in collaboration with researchers in Germany. This 'free of charge' data can help researchers understand how the landscape influences the connectivity of vampire bat populations, which could influence the spread of pathogens. 

We are slowly beginning to understand that all the metabarcoding data we generate to better understand community composition of a given environment contains several layers of information. It is perhaps much richer than an OTU table. That being said it is an entire different story on how to release let alone disentangle all that information.

It is great to gain insight into both predator and prey from DNA in droppings and blood meals. Apart from feeding on domestic animals, vampire bats occasionally took blood from wild tapirs, so the method may be useful for determining the distribution of elusive mammal prey. It is also of note that we found no evidence of vampire bats feeding on humans from the DNA left over from their dinners.

Tuesday, June 12, 2018

Citizen science vs giant slugs

Citizen science is a powerful tool to combat the challenges created by invasive species. Our study emphasizes the importance of collaborations between researchers, government administration, and citizen volunteers. 

The giant slug Limax maximus is an invasive species which made its way from northern Europe all the way to Japan and other regions of the world. It is a notorious pest of horticultural and agricultural crops. 

Recently a Japanese research team found that a certain set of weather conditions could be a reliable short-term indicator of how often giant slugs would appear on a set mountain path. The findings showed that the slugs were more likely to appear on days with higher humidity, lower windspeed and lower precipitation than the 20-year average. These observations can be used to predict future  outbreaks of the pest. 

This study was actually made possible by citizen science. In order to survey the number of slugs present on the mountain path chosen for the study (Mt. Maruyama route, in Sapporo, Japan), a volunteer naturalist hiked the path at 5:00 AM nearly every day for two years. The colleagues collected weather data obtained from a nearby meteorological station and combined them with observational data to calculate correlations between slug appearances and complex weather conditions.

Friday, June 8, 2018

Weekend readings

Need some readings for a sunny weekend? Not enough papers on the pile on your desk? Here is a solution for you. A couple of interesting journal articles I came across this week. Enjoy.

The genus Amara Bonelli, 1810 is a very speciose and taxonomically difficult genus of the Carabidae. The identification of many of the species is accomplished with considerable difficulty, in particular for females and immature stages. In this study the effectiveness of DNA barcoding, the most popular method for molecular species identification, was examined to discriminate various species of this genus from Central Europe. DNA barcodes from 690 individuals and 47 species were analysed, including sequences from previous studies and more than 350 newly generated DNA barcodes. Our analysis revealed unique BINs for 38 species (81%). Interspecific K2P distances below 2.2% were found for three species pairs and one species trio, including haplotype sharing between Amara alpina/Amara torrida and Amara communis/Amara convexior/Amara makolskii. This study represents another step in generating an extensive reference library of DNA barcodes for carabids, highly valuable bioindicators for characterizing disturbances in various habitats.

The correct identification of species in the highly divergent group of plants is crucial for several forensic investigations. Previous works had difficulties in the establishment of a rapid and robust method for the identification of plants. For instance, DNA barcoding requires the analysis of two or three different genomic regions to attain reasonable levels of discrimination. Therefore, new methods for the molecular identification of plants are clearly needed. Here we tested the utility of variable-length sequences in the chloroplast DNA (cpDNA) as a way to identify plant species. The SPInDel (Species Identification by Insertions/Deletions) approach targets hypervariable genomic regions that contain multiple insertions/deletions (indels) and length variability, which are found interspersed with highly conserved regions. The combination of fragment lengths defines a unique numeric profile for each species, allowing its identification. We analysed more than 44,000 sequences retrieved from public databases belonging to 206 different plant families. Four target regions were identified as suitable for the SPInDel concept: atpF-atpH, psbA-trnH, trnL CD and trnL GH. When considered alone, the discrimination power of each region was low, varying from 5.18% (trnL GH) to 42.54% (trnL CD). However, the discrimination power reached more than 90% when the length of some of these regions is combined. We also observed low diversity in intraspecific data sets for all target regions, suggesting they can be used for identification purposes. Our results demonstrate the utility of the SPInDel concept for the identification of plants.

Environmental DNA (eDNA) metabarcoding has been increasingly applied to biodiversity surveys in stream ecosystems. In stream networks, the accuracy of eDNA-based biodiversity assessment depends on whether the upstream eDNA influx affects downstream detection. Biodiversity assessment in low-discharge streams should be less influenced by eDNA transport than in high-discharge streams. We estimated α- and β-diversity of the fish community from eDNA samples collected in a small Michigan (USA) stream from its headwaters to its confluence with a larger river. We found that α-diversity increased from upstream to downstream and, as predicted, we found a significant positive correlation between β-diversity and physical distance (stream length) between locations indicating species turnover along the longitudinal stream gradient. Sample replicates and different genetic markers showed similar species composition, supporting the consistency of the eDNA metabarcoding approach to estimate α- and β-diversity of fishes in low-discharge streams.

The use of environmental DNA (eDNA) has become an applicable non-invasive tool with which to obtain information about biodiversity. A sub-discipline of eDNA is iDNA (invertebrate-derived DNA), where genetic material ingested by invertebrates is used to characterise the biodiversity of the species that served as hosts. While promising, these techniques are still in their infancy, as they have only been explored on limited numbers of samples from only a single or a few different locations. In this study, we investigate the suitability of iDNA extracted from more than 3,000 haematophagous terrestrial leeches as a tool for detecting a wide range of terrestrial vertebrates across five different geographical regions on three different continents. These regions cover almost the full geographical range of haematophagous terrestrial leeches, thus representing all parts of the world where this method might apply. We identify host taxa through metabarcoding coupled with high-throughput sequencing on Illumina and IonTorrent sequencing platforms to decrease economic costs and workload and thereby make the approach attractive for practitioners in conservation management. We identified hosts in four different taxonomic vertebrate classes: mammals, birds, reptiles, and amphibians, belonging to at least 42 different taxonomic families. We find that vertebrate blood ingested by haematophagous terrestrial leeches throughout their distribution is a viable source of DNA with which to examine a wide range of vertebrates. Thus, this study provides encouraging support for the potential of haematophagous terrestrial leeches as a tool for detecting and monitoring terrestrial vertebrate biodiversity.

Advances in DNA sequencing technology have revolutionised the field of molecular analysis of trophic interactions and it is now possible to recover counts of food DNA sequences from a wide range of dietary samples. But what do these counts mean? To obtain an accurate estimate of a consumer's diet should we work strictly with datasets summarising frequency of occurrence of different food taxa, or is it possible to use relative number of sequences? Both approaches are applied to obtain semi-quantitative diet summaries, but occurrence data is often promoted as a more conservative and reliable option due to taxa-specific biases in recovery of sequences. We explore representative dietary metabarcoding datasets and point out that diet summaries based on occurrence data often overestimate the importance of food consumed in small quantities (potentially including low-level contaminants) and are sensitive to the count threshold used to define an occurrence. Our simulations indicate that using relative read abundance (RRA) information often provide a more accurate view of population-level diet even with moderate recovery biases incorporated; however, RRA summaries are sensitive to recovery biases impacting common diet taxa. Both approaches are more accurate when the mean number of food taxa in samples is small. The ideas presented here highlight the need to consider all sources of bias and to justify the methods used to interpret count data in dietary metabarcoding studies. We encourage researchers to continue addressing methodological challenges, and acknowledge unanswered questions to help spur future investigations in this rapidly developing area of research.

DNA metabarcoding is a rapidly growing technique for obtaining detailed dietary information. Current metabarcoding methods for herbivory, using a single locus, can lack taxonomic resolution for some applications. We present novel primers for the second internal transcribed spacer of nuclear ribosomal DNA (ITS2) designed for dietary studies in Mauritius and the UK, which have the potential to give unrivalled taxonomic coverage and resolution from a short-amplicon barcode. In silico testing used three databases of plant ITS2 sequences from UK and Mauritian floras (native and introduced) totalling 6561 sequences from 1790 species across 174 families. Our primers were well-matched in silico to 88% of species, providing taxonomic resolution of 86.1%, 99.4% and 99.9% at the species, genus and family levels, respectively. In vitro, the primers amplified 99% of Mauritian (n = 169) and 100% of UK (n = 33) species, and co-amplified multiple plant species from degraded faecal DNA from reptiles and birds in two case studies. For the ITS2 region, we advocate taxonomic assignment based on best sequence match instead of a clustering approach. With short amplicons of 187-387 bp, these primers are suitable for metabarcoding plant DNA from faecal samples, across a broad geographic range, whilst delivering unparalleled taxonomic resolution.

The implementation of HTS (high-throughput sequencing) approaches is rapidly changing our understanding of the lichen symbiosis, by uncovering high bacterial and fungal diversity, which is often host-specific. Recently, HTS methods revealed the presence of multiple photobionts inside a single thallus in several lichen species. This differs from Sanger technology, which typically yields a single, unambiguous algal sequence per individual. Here we compared HTS and Sanger methods for estimating the diversity of green algal symbionts within lichen thalli using 240 lichen individuals belonging to two species of lichen-forming fungi. According to HTS data, Sanger technology consistently yielded the most abundant photobiont sequence in the sample. However, if the second most abundant photobiont exceeded 30% of the total HTS reads in a sample, Sanger sequencing generally failed. Our results suggest that most lichen individuals in the two analyzed species, Lasallia hispanica and L. pustulata, indeed contain a single, predominant green algal photobiont. We conclude that Sanger sequencing is a valid approach to detect the dominant photobionts in lichen individuals and populations. We discuss which research areas in lichen ecology and evolution will continue to benefit from Sanger sequencing, and which areas will profit from HTS approaches to assessing symbiont diversity.

Thursday, June 7, 2018

Who owns ocean biodiversity?

Within national jurisdiction, the Nagoya Protocol protects countries from exploitative bioprospecting, and is meant to foster greater equity. But there's a huge missing piece, because two-thirds of the ocean exists beyond national jurisdiction. That's roughly half the Earth's surface with no regulations on accessing or using genetic resources.

Marine organisms have evolved to thrive in various ocean environments, resulting in unique adaptations that make them the object of commercial interest, particularly for biomedical and industrial applications. Researchers from the Stockholm Resilience Centre and University of British Columbia have now identified 862 marine species, with a total of 12,998 genetic sequences that associated with a patent. They found that a single transnational corporation (BASF, the world's largest chemical manufacturer) has registered 47% of these sequences. Public and private universities accounted for another 12%, while entities such as governmental bodies, individuals, hospitals, and nonprofit research institutes registered the remaining 4%. Overall, entities located in only 10 countries accounted for 98% of the patents. 

A considerable portion of all patent sequences (11%) are derived from species associated with deep sea and hydrothermal vent ecosystems (91 species, 1650 sequences), many of which are found in unregulated areas beyond national jurisdiction.

Establishing a legal framework for marine genetic resources will be a core issue when international negotiations on a new UN treaty on the conservation and sustainable use of biodiversity in areas beyond national jurisdiction (BBNJ) begin in earnest in September 2018. By 2025, the global market for marine biotechnology is expected to reach $6.4 billion and span a broad range of commercial purposes for pharmaceutical, biofuel, and chemical industries. It is clear that these industry leaders must be involved in the upcoming BBNJ treaty negotiations, if only by virtue of their ownership of such a large share of the marine genetic sequence patents.

Wednesday, June 6, 2018

Deep learning to identify and count wild animals

This technology lets us accurately, unobtrusively and inexpensively collect wildlife data, which could help catalyze the transformation of many fields of ecology, wildlife biology, zoology, conservation biology and animal behavior into 'big data' sciences. This will dramatically improve our ability to both study and conserve wildlife and precious ecosystems.

Motion sensor camera trap' unobtrusively take pictures of animals in their natural environment, oftentimes yielding images not otherwise observable. The information in these photographs is only useful once it has been converted into numerical data. For years, the best method for extracting such information was to involve crowdsourced teams of human volunteers to label each image manually.

A team of researchers form the US and the UK has developed a system to automatically extract such information from images by using deep neural networks. The result is a system that can automate animal identification for up to 99.3 percent of images while still performing at the same 96.6 percent accuracy rate of crowdsourced teams of human volunteers. Deep neural networks are artificial neural networks with multiple hidden layers between the input and output layers. They require vast amounts of training data to work well, and the data must be accurately labeled (e.g., each image being correctly tagged with which species of animal is present, how many there are, etc.). For this study such data was available through Snapshot Serengeti, a citizen science project. Snapshot Serengeti has deployed a large number of camera traps in Tanzania that collect millions of images of animals in their natural habitat, such as lions, leopards, cheetahs and elephants. For this study 3.2 million labeled images tagged by more than 50,000 human volunteers over several years were used as training set.

Not only does the artificial intelligence system tell you which of 48 different species of animal is present, but it also tells you how many there are and what they are doing. It will tell you if they are eating, sleeping, if babies are present, etc. We estimate that the deep learning technology pipeline we describe would save more than eight years of human labeling effort for each additional 3 million images. That is a lot of valuable volunteer time that can be redeployed to help other projects.