Besides similarity data, a time-series of matrices is another type of data that presents challenges of the type commonly encountered in information visualization. Two major difficulties arise in dealing with time-series matrices. The first difficulty is to identify differences in the matrix values between successive matrices. The second difficulty is that there are many visual representations that can be applied. For example, the "cityscape" representation shows the matrix values as 3D bars, whereas the "heatmap" representation show the values as colored tiles [28]. Different representations extract different features, so an easy way to view and explore these several representations simultaneously is needed. Fortunately, the spreadsheet environment is excellent for dealing with these difficulties.
We encountered two matrix series in trying to solve problems with
molecular biologists, who are interested in studying the effect of
mutation and natural selections on genetic sequences. Natural
selection accepts certain mutations, which result in the substitutions
of one protein residue by another residue. For a mutation to be
accepted, the protein usually must function in a similar way to the
old one, presumably due to chemical and physical similarities. PAM
and BLOSUM are two series of matrices with each matrix representing
substitution probabilities at a given evolutionary
distance [7, 11]. The two matrix series were
calculated from different sets of information sources. An element
of a matrix specifies the relative probability that the amino
acids i and j will be substituted after a given evolutionary
interval. A positive entry specifies an accepted mutation that is
more likely than random, whereas a negative entry specifies less
likely than random.
The detailed nature of this series of matrices results in a large amount of information [7]. For example, these matrices are used in the calculation of similarity between sequences. Unfortunately, the computational molecular biology community have not applied visualization techniques to these matrices. To be sure, biologists are very interested in understanding the nature of these series of matrices due to their mathematical and biological complexity. The computational molecular biology community seeks to understand these matrices, because the choice of which matrix to employ is dependent on the situation.
We have used the SIV system (the second prototype) to try to gain a better understanding of these matrices. We used our system to compare the two matrix series (PAM and BLOSUM), and found that the ability to quickly bring in data and lay them out in different ways to be extremely useful. For example, after 7 lines of commands, the last row shows the BLOSUM62 matrix. To understand the differences between the matrices, it is important to be able to visually compare a number of different matrices simultaneously. In Figure 2, the first, second, third, and fourth rows of cells visualize the PAM40, PAM120, PAM250, and BLOSUM62 matrix, respectively. The first column uses a cube representation that maps positive matrix values to the volume, height, and color attributes of the cubes. The second column uses a carpet plot that maps values to the height and color of a 3D surface (using a rainbow colormap with negative entry mapped to red). The third column uses a bar representation that maps values to the length, height, and color attributes of the bars. The fourth column shows various representations in different rotational configurations.
In Figure 2, by vertically scanning the spreadsheet, the user can detect differences between matrices quickly. As we can see from all the columns, the diagonals of these matrices have strong values, which makes sense since the identity substitution (no mutation) is favored by evolution. From the second column we see that the matrices are quite different because the colors get brighter and brighter from top to bottom. The last row shows the BLOSUM62 matrix, and we see its values are clearly different from any of the PAM matrices shown.
Figure 2: Visualization of time-series matrices. The visualization is
built using the second system (SIV). The screen snapshot shows
visualizations of protein residue substitution probability matrices of
various evolutionary distances. The first, second, and third rows
visualize matrix 40, 120, and 250 from the PAM matrix series. The
fourth row visualizes matrix 62 from the BLOSUM matrix series.
The first column uses a cube representation that maps positive matrix
values to the volume, height, and color attributes of the cubes. The
second column uses a carpet plot that maps values to the height and
color of a 3D surface. The third column uses a bar representation
that maps values to the length, height, and color attributes of the
bars. The fourth column shows various representations in different
rotational configurations.