We first analyze a DNA sequence from the well-studied plant Arabidopsis thaliana (mustard weed). Figure 5 shows a 3D scatter plot of three measures of similarity for Arabidopsis sequence 10G8T7P. Percent identities, similarity score, and P-value are mapped to the X, Y, and Z axes, respectively. In general, as similarity score increases on the Y-axis, we expect the percent identities and the P-value to increase. Therefore, we expect the alignments to fall mostly on the diagonal from the origin to the top right back corner.
Figure 5: 3D scatter plot of Arabidopsis sequence 10G8T7P: the
X, Y, Z axes are percent identities, score, and P-value, respectively
As the score increases, we do see percent identities increasing in general. In another rotated view that is not shown, the scatter plot also shows P-value increasing as expected. However, there are two lines of points, one red and one green, that extend to the right without corresponding increase in score --- these alignments have high percent identities but low scores.
Figure 6: Using 3D scatter plot with an additional time axis to animate
10G8T7P. The X, Y, Z, and time axes are percent identities, score,
P-value, and length, respectively. The left, middle, and right
snapshots are frames representing length 0--10, 10--20, and 20--30,
respectively.
Using the visual query filters dynamically, we notice one particular variable affects those stray points the most --- alignment length. We then animate the visualization using length as our time axis. The animation, several frames of which appear in Figure 6, shows the stray points correspond to very short alignments. This accounts for the low scores even though the percent identities are high.
Glyphs and points in AV are hyperlinked with the actual alignment report in hypertext HTML format. Clicking on the points reveals the stray points correspond to alignments containing a light-harvesting complex chlorophyll binding protein. The short alignments with high percent identities correspond to ``motifs'' that are highly conserved in the binding protein. Motifs are regions that have been preserved with little change over evolution, presumably because their function is important to the survival of the organism.
This example uses the common approach of exploring general trends and outliers to identify interesting features in the data. Moreover, the visual query filters aid in finding these features, because the user can interactively explore the correlation between variables.