Much has been done in the field of multivariate visualization, so our treatment of related work is illustrative rather than comprehensive. We'll focus in this section on work most closely related to ours, including techniques used in dynamic multivariate statistics systems, additional techniques for displaying high dimensional data, and other biological sequence visualizations.
Statisticians have investigated the display and exploration of multidimensional data. Common techniques used in systems such as PRIM-9 [18] and MacSpin [9] include two and three dimensional scatter plots, the ability to rotate 3D displays dynamically, simple animation capabilities, and the ability to mask or mark subsets of the data. Many of these packages center around a concept called projection pursuit, which is the ability to step through different projections. In many cases, very high dimension pointsets are considered. Projection pursuit techniques ignore the semantics of the different variables, and treat every axis equally. Because exhaustively enumerating all possible projections is prohibitive when the dimensionality is large, ``grand tour'' algorithms have been developed to automatically choose sequences of different projections [3,7].
In the visualization community, a number of techniques have been used for displaying high dimensional data, including glyphs, worlds-within-worlds, and parallel coordinates. In the use of glyphs, two or three variables of a datum are often used to position a small marker representing the datum, while a number of other variables are encoded by the marker's size, color, etc. In worlds-within-worlds, a point in 3D is first specified, then a second smaller frame is displayed at this point. A surface can then be drawn using a new coordinate system within the second frame [10]. Parallel coordinates lays out major axes in parallel with each point represented by a line connecting each axis [14,15]. The use of these and related techniques usually involves interaction. For example, a user controlled probe may indicate a location at which a world-within-world frame is then displayed to provide more information. Or a user may specify a range in the parallel coordinate technique to filter out lines outside that range. Another example, from user interface research, uses two dimensional scatter plots in conjunction with interactive query filters in [1]. Dynamic interactive capabilities have been found to be essential in exploring high dimensional data, such as the network visualization system in [4].
Previous work in biological sequence visualization concentrated on single sequence representations, which are alternatives to the DNA alphabet. The H-Curve, W-Curve, and chaos game representation are iterative methods for representing a long DNA sequence [12,16,19].
Our work is both similar to and different from these related projects. Our system has many similarities to the dynamic multivariate statistics packages. For example, our new system uses scatter plots in some situations, allows different projections of the data, continues to allow interactive geometric transformations of the scene, and incorporates some, albeit different, data filtering capabilities. It differs from most such systems in that it still uses the glyph representation in many situations and has more emphasis on the use of animation for data display and exploration than most statistics packages. Furthermore, while both our system and certain statistics systems allow easy user defined mappings of variables to axes, our variables have strong semantics associated with them and our users are likely to know which mappings are most useful in given situations. We therefore did not make use of techniques like projection pursuit that treat different variables equally.
Our work also shares some of the characteristics of the higher dimensional visualization techniques mentioned. We use scatter plots and visual query filters similar to [1], and found the technique extremely useful for sequence data. Similar to worlds-within-worlds, each of our points opens up to another world in certain mappings, which is described by a glyph shaped like a comb. Similar to the parallel coordinates method, we use sliders to filter and construct queries. While our technique has similarities with each of the above methods, there are some obvious differences. Our system makes a significant departure from [1] by using glyphs and incorporating an additional time axis, thus introducing animation as an additional tool for correlation between variables. The worlds-within-worlds method has been successfully used for examining point information that are dense in multidimensional space, such as points on a hyper-surface or values in a vector field [10]. However, our multivariate data are more sparse, requiring a different approach. No obvious method exists for modifying the parallel coordinate technique to depict the alignment itself, since there can be hundreds or even thousands of matching positions in the alignment. Our employment of the glyph technique makes the alignment itself and its associated data visible. Rather than applying these higher dimensional visualization techniques individually, our system combines these different techniques, providing a simple but powerful set of tools for exploring data. Moreover, because of the interaction between different tools, their combined use provides more capabilities than if we had applied the tools independently.
While our system visualizes biological sequence information, it differs significantly from the single sequence representations mentioned above. While such representations can find interesting features of individual sequences, they are difficult to use for comparing sequences. Comparison of two sequences would involve detailed visual inspections of a pair of three-dimensional curves or 2D plots. Further, while single sequence representations are valuable for viewing small amounts of sequence data, they are simply not designed for large datasets.
Given the complexity of our multivariate data, our motivation is to combine various techniques into a single method that enables biologists to discover relationships that would otherwise be difficult to discover due to the dimensionality of the data.
The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.