A common starting point for the computational analysis of proteins is the construction of a multiple sequence alignment (MSA). Insofar as they result from protein functional similarities and differences, the patterns of residue conservation and divergence within such an alignment provide clues to biological function. Of course the biological relevance of any observed patterns depends upon an alignment’s accuracy, and alignments of larger sequence sets have greater statistical power. For biologically appropriate scoring systems applied to more than a very small number of sequences, however, no optimization procedures are known that are both tractable and rigorous; thus all practical MSA programs rely upon heuristic methods.
Indeed most commonly used alignment tools typically compare sequences and rank alternatives at each branching step based on available information to decide which branch to follow. This is faster than an exhaustive search but still takes a prohibitively long time to compute for sets of a hundred thousand or more related sequences. In addition there is no guarantee that a heuristic search provides the best solution but rather an approximation.
Two researchers have now developed a new algorithm that is both faster and more accurate. Instead of comparing sequences to each other, it compares each sequence to an evolving statistical model. This approach is not only faster, but is also better at finding biologically relevant signals within such alignments. Their new program is called GISMO, an acronym for "Gibbs Sampler for Multi-Alignment Optimization." Gibbs sampling, a statistical technique for solving highly complex problems, is a central feature of the approach.
At this point GISMO works only for protein alignments and the authors are the first to point out that there is room for improvement. Because researchers have been finding ways to speed up and improve conventional methods for decades and because GISMO takes such a new and different approach, I am confident that we can make GISMO even faster and more accurate going forward. The reason - For large sequence sets, this approach offers clear advantages in alignment accuracy over the most popular programs currently available.