3 Alignment Viewer's Representation



next up previous
Next: 4 Case studies of Up: Visualization of Biological Previous: 2 Background and Motivation

3 Alignment Viewer's Representation

Our visualizer is called Alignment Viewer (AV) and is a result of a collaborative effort between computer scientists and molecular biologists. This collaboration allowed us to determine the most important features of the output of similarity search algorithms. We designed the representation to enhance these features graphically.

First, AV depicts positional information. Each alignment is between a subsequence of the query sequence and a subsequence of a sequence from the database. The position and length of an alignment identify a region of similarity between two sequences. In the X-axis, AV plots the absolute position in the query sequence. In Figure 1, the X-axis runs from the origin toward the viewer. An alignment is signified by a comb-like object, such as the ones shown in Figure 1. The beginning, end, and the relative length of each object correspond to the beginning, end, and the length of the alignment.

Second, AV depicts the frame number of each alignment. The frame number defines how the DNA sequence is translated into a protein sequence. DNA sequences are composed of a four letter alphabet (A, C, G and T for each of the four nucleotides). Three DNA bases encode one single protein residue (also called an amino acid), so there are 64 possible residue encodings. These 64 encodings represent the 20 fundamental residues with some redundancy. Protein sequences use a 20 letter alphabet for the 20 different residues.

A DNA sequence can encode a protein sequence starting from the first, second, or third position. Biologists do not always know which position the encoding starts from. The starting point determines how bases are grouped into residues. DNA sequences are double stranded, with one strand named positive and the other negative. Biologists do not always know whether a particular sequence is from the positive or negative strand. Sequences from the negative strand must be translated into proteins in reverse (from right to left, as the sequence is usually written). Thus, there are six different ways of translating a DNA sequence into a single protein sequence: start from the first, second, or third nucleotide of the positive strand, or start from the first, second, or third nucleotide of the negative strand. The six ways of translation are called the reading frames of the sequence, and are labeled with frame numbers +1, +2, +3, or -1, -2, -3, respectively. Thus, when comparing a DNA sequence to a protein sequence, all six possibilities must be considered, and the frame number recorded.

AV presents reading frame information by using both colors as well as layers in the Z-direction, which runs left to right in Figure 1. AV puts all +1 frame alignments in the first layer, +2 frame alignments in the second layer, and +3 frame alignments in the third layer. The fourth layer is reserved for all negative frame alignments; these alignments appear only as white lines. The user can choose to reverse the negative and positive frames so that the negative frames appear in the first 3 layers and with color, with the positive frames in the 4th layer as white lines. This separation between the positive frames and the negative frames is because most of the time the biologist is interested in only positive frame alignments, or only negative frame alignments. Figure 1 shows the different layers. The white line labeled C is a negative alignment in the fourth layer.

The frame number is also encoded by color. Color actually has a dual purpose, encoding both frame number and residue pair score (which is explained later). For this reason, each frame is coded with two colors. The +1 frame alignments are shown with red/blue coloring, +2 alignments with green/yellow, and +3 alignments with magenta and cyan.

We encoded the frame information using both layers and colors, so that the frame information is not lost when viewed from certain positions. For example, when viewed down the Z-axis, the layer information is lost, if only the Z-coordinate is used to distinguish frame numbers. On the other hand, if only the colors are used, objects toward the front would tend to mask out objects near the back when viewed down the X-axis. By separating the frames into layers, the amount of information encoded in each layer is reduced.

AV can also assign the colors dynamically. In the dynamic color assignment mode, the frame with the highest score is assigned red/blue, and the frame with the 2nd highest score is green/yellow. Magenta/cyan is assigned to the 3rd highest scoring frame. Dynamic color assignment becomes a crude way of guessing the correct reading frame for a given sequence. This can be useful when scanning large amount of sequence reports, because the biologist can simply look for red regions.

Third, AV depicts the similarity scores along the Y-axis (the up direction in Figure 1). The search algorithms produce statistical measures of similarity. These statistical techniques are built upon the substitution matrices pioneered by Dayhoff et. al. in [8]. Each entry in a substitution matrix measures the likelihood that one amino acid could replace another in a sequence, because of genetic mutation and selection. Because some amino acids are functionally similar, some substitutions are favored biologically more than others.

Similarity algorithms use substitution matrices to compute the similarity score of an alignment. For a given pair of residues in an alignment, an algorithm like BLAST looks up the entry in the matrix and gets the residue pair score. This residue pair score is then a measure of the strength of the match. To compute the actual score of the alignment, BLAST sums all the residue pair scores in the alignment. The location on the Y-axis of an alignment represented by a comb object corresponds to its similarity score. For example, in Figure 1, the alignment labeled B has a score of 95.

If the residue pair score is positive, then the replacement of the residues is considered likely and represented by the positive colors of each frame (red, green, and magenta, respectively). If the residue pair score is negative, then the replacement is considered unlikely and the negative colors are used (blue, yellow, and cyan, respectively).

Each of the residue pair scores is also encoded by the length of the tooth-the stronger the value of the residue pair score, the longer the tooth. For example, the line labeled by A in Figure 1 has a residue pair score of 17, and the line labeled by B has a residue pair score of -4.

We implemented AV on Sun and SGI workstations running X windows. Commercial visualization engines might have served our needs, but using an existing package would still have involved a fair amount of development. We also wanted to integrate AV with our development of a DNA sequence analysis database system [24]. Furthermore, we wanted to distribute our application widely to molecular biologists, who might not have easy access to commercial visualization packages.

In summary, AV represents the length and position of alignments, the frame numbers, the similarity scores, and the residue pair scores. These elements of the textual report are essential to the data analysis, and demonstrate the multi-dimensional nature of sequence similarity data. By collaborating with biologists working on sequencing projects, we ensure that the above representation is understandable and effective in the analysis of sequences.



next up previous
Next: 4 Case studies of Up: Visualization of Biological Previous: 2 Background and Motivation



Ed H. Chi (echi@cs.umn.edu)
Fri Apr 28 12:51:35 CDT 1995