Sequences used in this experiment were several sequences from the Michigan State University Arabidopsis thaliana Sequencing project and the complete Visna lentivirus from GenBank (Accession Number M10608). The GenBank Release 83 database contained 179,578 sequences with 194,040,877 total bases. The protein database, which contained both PIR Release 40 and GenPept Release 83, contained 124,955 sequences with 34,935,042 total residues.
We chose Arabidopsis thaliana as one of our test cases because it is one of the most important plant models for genetic sequencing, and because we are the computer science research arm of the Michigan State sequencing project so we have 20,000 sequences to choose from[13]. The study of the performance of multiprocessors on the BLAST algorithm was originally inspired by the analysis needs of this sequencing project. We chose Visna lentivirus as one of our test cases because it is significantly different from the plant sequences used in this study, and is a biologically interesting close relative of the Human Immunodeficiency Virus. We expect Visna to have a large number of hits to the database, because HIV and other immunodeficiency viruses are being studied closely.
The databases were loaded into shared memory before each test. All three SMP architectures were able to load the entire database into memory. In order to obtain confidence in the results, each run was repeated at least four times, with some repeated as many as ten times. Tests were done during dedicated time, with no other users on the computers or their networks. Single-user mode was used to do the measurements on the sc2000, while multi-user mode was used on the CS6400. On the SGI, measurements were done twice -- first in single-user mode, then in multi-user mode -- to test the effect of single-user mode on the results. The measurements showed no significant difference.
In this experiment, one set of tests was conducted to check the content independence hypothesis. Another set of tests was used to check the length--run time hypothesis. In the content test, five different sequences of length 400 bases were used. For the length test, the Visna sequence was trimmed into different lengths (250, 500, 2500, and the entire 9202 bases). We used the Visna sequence in the length test, because it was long enough to span the range of lengths we wanted to test. We ran the length and content tests on BLASTN, BLASTP, and BLASTX on four different architectures --- the three SMP machines above and on a Sparc 10 Workstation. In all tests only a single processor was used at a time.