The field of molecular biology is rooted in the desire to determine the genes within the cells of organisms, the function of the proteins that these genes encode, and how these proteins are related evolutionarily across organisms. Knowing the function of a protein helps explain the biochemistry of the organism. Understanding evolutionary relationships between proteins aids in identifying the fundamental similarities and differences between organisms.
Genes, composed of DNA, are represented as discrete sequences of nucleic acids, also called bases. Proteins are represented as discrete sequences of amino acids, also called residues. Genes and proteins are typically represented as sequences of letters. Among biological organisms, genes and proteins are related through evolution, and share common functions. As molecular biologists discover new genes and the function of corresponding proteins, the information is being cataloged in the form of nucleic acid sequence databases for genes, and amino acid sequence databases for proteins.
Molecular biologists who conduct large-scale genetic sequencing projects are producing an ever-increasing amount of sequence data through the use of automated DNA sequencing machines. GenBank, the primary repository for DNA sequence data, contains roughly 499,000,000 nucleotides in 744,000 sequences as of April 1996, and is doubling every 1.3 years [5,2]. The rate of increase will further accelerate as large projects such as the Human Genome Project [11] and the Arabidopsis thaliana Genome Project gain more momentum [14]. Keeping pace with the analysis of this data is a difficult task for biologists.
One of the most successful techniques for analyzing genetic data is through sequence similarity analysis using sequence similarity algorithms---the comparison of a single sequence against the databases of known sequences. Sequence similarity algorithms are a well-developed aspect of computational molecular biology research and employ dynamic programming and heuristic search techniques. These algorithms identify similar regions (also called alignments) between sequences. These alignments provides clues to possible protein functions for the unknown input sequences, reducing the need for painstaking lab work [6]. The information that a computer provides in a few hours would otherwise take months of lab work.
Ironically, as more information about sequences becomes available, reducing the need for laborious laboratory work, the task of analyzing the sequences on computers becomes increasingly difficult. BLAST is one of the most popular similarity search algorithms in use today, but its running time is proportional to the size of the database. Similarity analysis using BLAST is becoming a bottleneck in genetic sequence analysis.
Two different performance goals arise during the analysis of sequence data. Often there are many sequences that must be compared against a database, either because these sequences are brand new and have just arrived from the laboratory, or because there has been a new release of the database. In these situations, the throughput, or number of sequences analyzed per unit time, is most important.
On the other hand, in day-to-day sequence analysis, biologists need to analyze individual sequences against the database. On a single processor of today's standards, a medium-size sequence can take over 10 minutes to process, while a long sequence can take nearly an hour. Moreover, since the run time of similarity algorithms depends on the size of the database, the amount of time to analyze a sequence using computers is increasing at an alarming rate. The biologist is reduced to a batch mode operation: she submits a sequence, leaves the computer running, and comes back later for results. Much faster run-times are needed, so biologists can use the computer interactively for exploratory analysis. For individual sequence analysis response time is much more important than throughput.
The current trend of improvements on processors is stated by the updated Moore's law, which says processor speed is doubling roughly every 1.5 years. However, databases are doubling in size every 1.3 years. Since sequence similarity algorithms take time proportional to database size, and since database size is growing faster than processor speed, we cannot wait for uniprocessor speed increases to solve our throughput and response time problems. A different approach is needed to speed sequence similarity search.
Shared-Memory Multiprocessors (SMPs) may offer performance that scales with the growth of the genetic sequence databases. Analyzing the performance of BLAST on these machines will improve our theoretical and practical understanding of the scalability of the algorithm, and will lead to criteria for the efficient application of SMPs to sequence similarity analysis.
The remainder of the paper is structured as follows. In the next section, we present related work, and in Section 3 we discuss in detail the research questions we hope to answer. In Section 4 we briefly mention the high-level design of our experiments. Section 5 and 6 contains the results and analysis of the experiments. Finally, Section 7 contains concluding remarks.
The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.