Genetic information is usually expressed as a sequence of characters from an alphabet, so similarity algorithms were derived from the dynamic programming techniques that detect substrings within a string. An similarity algorithm compares two sequences with each other. To compare a sequence with a database of sequences, we simply apply the similarity algorithm repeatedly to each of the database sequences. The output of similarity algorithms are alignments between a input sequence and all database sequences. Each alignment describes a juxtaposition of certain characters in the database sequence with certain characters in the search sequence. However, finding biological similarities required extending the substring algorithms to handle multiple insertions, deletions, and substitutions within the substring.
Insertions, deletions and substitutions are assigned different costs,
based on our understanding of the likelihood of the biological events
that would lead to the corresponding change in the sequence. Handling
arbitrary costs for insertions, deletions, and substitutions slows the
search algorithms. By restricting the cost functions in certain ways,
it is possible to obtain algorithms that runs in roughly
,
where M is the length of the input sequence and N is the length of
the database. Early dynamic programming similarity algorithms with
variations on this scheme include Gotoh [7], Needleman and
Wunsch [10], and Smith and Waterman [16].
However, these early dynamic programming techniques are still too slow because the databases are so large. A further algorithmic improvement is needed. Computational biologists notice that most good alignments contain large sections of exact matches and substitutions, with very few insertions and deletions. By restricting the search to finding long regions of exact matches and substitutions, algorithm designers are able to accelerate the search further at the cost of some loss in accuracy. Moreover, we can compensate for the lost sensitivity by a post-processing phase where each long exact match is extended in both directions by considering insertions and deletions. BLAST [1] and FASTA [12] are examples of this approach.
There are two ways of locating long diagonals---scanning the entire database, or using an index to look for exact matches without scanning the entire database [3]. Most algorithms currently use the first approach because a complete index takes considerably more amount of memory than a flat file database, and the need for random access to the index slows down the execution time. For this reason, BLAST uses a linear search, modified to build a hit word list by considering all possible words of a certain fixed length and keeping ones that are likely to have biological significance.
Even with all of these improvements to similarity algorithms, the performance of similarity algorithms is a bottleneck for many applications. Performance limitations are of increasing importance because of the rapid growth in the size of sequence databases. One approach to improve performance is to use multiprocessors.
In recent years, computational biologists have employed message passing architectures for more computational intensive algorithms such as Needleman and Wunsch. They have obtained good results on commercially-available machines as well as on clusters of workstations [15]. Work has also been done using massively parallel computers [4,8].
Our work extends previous work by studying the performance of the most common sequence similarity algorithm---BLAST---on shared memory multiprocessors. We seek to identify both the fundamental limitations on the parallelism of the BLAST algorithm, as well as the practical limitations of SMP machines for such a memory-intensive application.
The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.