A few posts ago I was talking about a new publication that put DNA Barcoding sequences to some quality tests. Today the journal MycoKeys published a paper that represents a compilation of best practices which can help with the quality management of sequence data.
The lead author of the paper, Henrik Nilsson, on the motivation to write the publication: "Many researchers find sequence quality control difficult, though. There just isn't any straightforward document to put in their hands to give them a flying start. As a result, scientists differ in the degree to which they are aware of the need to exercise sequence quality control and in what measures they take."
The authors focus on the DNA Barcode for fungi, ITS but all suggestions and recommendations are broadly applicable to other markers and organisms. The paper is a collection of really good guidelines and especially useful for starters. This is one of the papers I would give any freshman student to read if I had any :-)
Here a summary of their 5 guidelines (Table 1 of their publication).
|Target of guideline||Way of getting there|
|1. Establish that the sequences come from the intended gene or marker||Do a multiple alignment of the sequences and verify that they all feature some suitable, conserved sub-region (here the 5.8S gene)|
|2. Establish that all sequences are given in the correct (5’ to 3’) orientation||Examine the alignment for any sequences that do not align at all to the others; re-orient these; re-run the alignment step; and examine them again|
|3. Establish that there are no (bad cases of) chimeras in the dataset||Run the sequences through BLAST in INSD/UNITE and verify that the best match comprises more or less the full length of the query sequences|
|4. Establish that there are no other major technical errors in the sequences||Examine the BLAST results carefully, particularly the graphical overview and the pairwise alignment, for anomalies|
|5. Establish that any taxonomic annotations given to the sequences make sense||Examine the BLAST hit list to see that the species names produced make sense|