4 Case studies of Alignment Viewer



next up previous
Next: 5 Conclusion and Future Up: Visualization of Biological Previous: 3 Alignment Viewer's Representation

4 Case studies of Alignment Viewer

We now discuss an AV visualization of the Human Immunodeficiency Virus (HIV), and a visualization of a sequence from a well-studied plant called Arabidopsis thaliana, commonly known as mustard weed. These case studies have been chosen because they are of interest to the molecular biologists in our research group and because they illustrate features of AV.

We took a section of the HIV sequence and ran the BLAST algorithm against GenBank and the PIR-International Protein Sequence Database. The sequence (GenBank Sequence K02012) has 5362 bases, which translates to about 1787 residues. The BLAST textual report is roughly 3200 pages, and contains a total of 6692 alignments to the GenBank database. Each alignment in the text report looks like the one shown in the right hand side of Figure 5, where the text report has been parsed and converted to hypertext by our analysis engine [24]. We filtered out all alignments to HIV itself, and 1867 alignments remain, which is still roughly 800 pages of printed text. Analyzing this amount of data in a textual report is prohibitive.

Figure 2 is Alignment Viewer output for the same report. The graphical view condenses 800 pages of text into one screen of information. The left hand side is a 3D view, while the right hand side is a 2D projection. The positions and relative lengths of the alignments provide a quick summary of where alignments are located along the query sequence. By rotating the 3D figure, the user can immediately see that there are no negative frame alignments, since there are no white lines in the fourth layer. This example shows that using a single color for all negative frame alignments allows the user to immediately determine whether the query sequence is positive or negative.

The color guide on the right hand side of the picture shows the color assigned to each frame. We see a large region of green/yellow +3 frame alignments toward the front of the sequence from about 1 to 500, and a second large region of red/blue +2 frame alignments from about residue 500 to 1450.

Sometimes a single DNA sequence encodes proteins in two slightly overlapping frames. The complicated mechanism that makes this possible is called a frame shift [27]. Biologists are interested in frame shifts, especially since they are very important to the function of the sequence and difficult to discover.

In this example, the first green/yellow region seems to encode one protein in one frame, and the second red region seems to encode a different protein in another frame. The color changes suggest the occurrence of a frame shift. Indeed, in HIV, the so-called gag protein in the first region does overlap with the pol protein in the second region [27]. The detection of the frame shift in the text report would be time-consuming because it would require looking through 800 pages of text. AV's use different colors for different frames makes this phenomenon stand out immediately.

In Figure 3, we used the zooming feature in AV to zoom up close to the gag protein region. As can be seen in this figure, two regions labeled as A and B correspond to large concentrations of positive residue pair scores. Remember that a positive residue pair score corresponds to a strong match. Biologists are interested in the identification of conserved regions in sequences. Conserved regions are regions of a sequence that have been preserved over evolution, and are less likely to change due to mutation and selection. A region that is conserved through evolution may play an important role in the function of the sequence. The large concentration of positive scores suggest that the two regions labeled A and B are conserved.

We also see in Figure 3 a region labeled C where negative residue pair scores are abundant. These negative pair scores suggest a region in which residues are weakly conserved, since there are more differences between the query sequence and the database sequences. Most of these alignments turn out to be alignments to the gag protein of the Simian Immunodeficiency Virus (SIV), a distant relative of HIV that infects monkeys. Thus, region C is likely to correspond to a segment where HIV and SIV are different biologically. Since SIV cannot infect humans, biologists are very interested in such differences.

The ability to interactively zoom into the relevant region was important in the identification of these conserved regions. Zooming away from the image makes it easy to detect broad similarity features. Zooming makes details clearer and simplifies the identification of regions A and B as local regions of high similarity to other sequences.

As described above, our representation emphasizes conserved regions. In scanning the Arabidopsis sequences our research group has analyzed using AV, we found many sequences with possible conserved regions. In Figure 4, we show an example of one of these sequences. The two bands of red positive residue pair scores labeled as A and B are quite obvious in this example. The first region, labeled as A, is apparent from almost every angle, whereas the second region, labeled as B, is more visible when viewed straight down the Z-axis. The ability to rotate and look at the visualization from different perspectives was crucial to the identification of this region. The alignments in the A and B regions of this report are to a variety of binding proteins and ribonucleoproteins, which suggests that A and B conserved regions might share a common function with these other database sequences.

In AV, we use a fat line approach to ensure rapid real-time rotation and zooming. During a mouse-driven user interaction, such as rotation or zooming, each alignment is reduced to a single fat line; when the user releases the mouse button, the full-feature visualization is displayed. The number of lines drawn in a frame during a single rotate in the example from Figure 2 is 1857 using fat lines compared to 123229 for the full image. This technique greatly reduces the amount of elapsed time between frames during an interaction, and more feedback is given to the user.

Sometimes detailed information about a particular alignment is needed. Biologists must then consult the original textual report. AV projects the three dimensional representation onto the XY-plane as shown in Figure 2. The user can translate, zoom, and rotate both the 3D and 2D representations. The 2D projection was created to allow direct interaction with a single alignment. When a user clicks on an single alignment, the hypertext document browser showing the text report [24] will jump to the correct place and display the detailed information about that alignment (see the right half of Figure 5). The hyperlinks from the graphical output to the textual output ensure that all information is accessible to the user. Finding all the information related to a particular alignment becomes easier because AV provides a visual index to all of the alignments in a report, whereas in the past users had to search for a particular alignment through pages of text.

When the user chooses the alignment, a curve plot also appears on the screen as shown in Figure 5. The plot shows the different similarity scores from different substitution matrices. We call this the alignment matrix curve.

Different substitution matrices allow different degrees of mismatches and mutations. These matrices are either experimentally or theoretically derived. The Dayhoff matrices were called PAM (Point Accepted Mutations) matrices [8]. PAM is a rough measure of how many generations of evolution it would take to mutate one sequence into another. Thus, the PAM scale corresponds roughly with different evolutionary distances. On the PAM scale, a low number signifies a close evolutionary distance. For example, the PAM120 matrix allows fewer mutations than PAM250.

In the BLAST result of HIV in Figure 2, the PAM250 matrix was used. This meant that the similarity algorithm was most sensitive to alignments near a PAM distance of 250. Biologists do not always know which matrix to use for a particular sequence.

In the matrix curves of Figure 5, the similarity scores were recomputed and normalized statistically using different matrices [16]. The X-axis on the curve plot is the evolutionary distance measured in PAM, and the Y-axis is the renormalized score computed using a particular PAM matrix. The alignment matrix curve for different alignments peaks at different distances. For example, the alignment labeled A peaks at PAM60, whereas the alignment labeled B peaks at PAM120. These peaks provide estimates of the evolutionary distances of the alignments. This curve was added to AV to help biologists determine which substitution matrix to use.

The alignment matrix curves from the HIV PAM250 report peaked mostly around 40-120 PAM. Since PAM250 is most sensitive for alignments at 250 PAM, and less sensitive for alignments at 60 PAM, these peaks might suggest that the initial choice of PAM250 is too insensitive for most of the alignments in this report. After looking at these matrix curves and their peaks, we decided to run BLAST with a PAM matrix at a closer evolutionary distance that allows fewer mutations. In Figure 6, the PAM60 matrix was used. There was a noticeable decrease in the number of alignments from 1867 to 1516. The major features were retained, but the amount of low scoring alignments near the bottom of the plot (score 50) decreased significantly.

The PAM60 textual output is also over 800 pages, and the comparison between these two outputs with different matrices is not possible using only the textual outputs, but is possible only by using this graphical representation method. The visualization overviews provided by AV concisely summarizes the textual data, and this makes the above comparison possible.

Our research group has a database of more than 15,000 sequences and their analysis reports. Instantaneous access to all of the data is as critical in the analysis as the data interpretation. Each hypertext report has a static AV plot that is rotated to a fixed position. The static image then serves as a hyperlinked icon to the actual AV input file. Since AV is integrated with the World Wide Web, any user with AV installed on her workstation can click on the image icon and interact with the AV visualization directly.

In this section, we showed several examples of the use of AV. In particular, we showed how to identify a possible frame shift by looking for a color change in the graphical output of AV's representation. We also showed how to find possible conserved regions by scanning for large concentrations of positive colors that correspond to strong matches. AV allows the user to examine the data closely using zoom, translate, and rotate. These features also helped in the discovery of conserved regions. The visualization is hyperlinked with the textual report to provide the most detailed information. AV also provides alignment matrix curves to estimate the evolutionary distance of an alignment between two sequences. This additional piece of information can be used to estimate which substitution matrix to use next. These features of AV have proven to be useful in the analysis of sequence similarity.



next up previous
Next: 5 Conclusion and Future Up: Visualization of Biological Previous: 3 Alignment Viewer's Representation



Ed H. Chi (echi@cs.umn.edu)
Fri Apr 28 12:51:35 CDT 1995

The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.