Problem solving and decision making are essential components of most complex tasks and are increasingly supported through computer information displays [8]. In our daily lives, the newspapers and magazines often employ graphical design principles to communicate statistical information. Our experiences show that the approach to graphic presentation can hinder or promote accurate and effective processing of information [10]. The field of visualization has emerged as researchers seek creative solutions of ways to represent data through the use of interactive computer graphics.
Interest in visualization-based user interfaces has blossomed in the past few years, with systems developed for applications from aerospace engineering to geology, molecular biology, world wide web structure, and animal behavior patterns [8, 1, 12]. Since the problems of information analysis are grounded in the needs of a discipline, the design of the visualization requires domain specific knowledge [8].
Because of the wide variety of data domains, the challenge is to design an environment that enables users to perform difficult tasks in an intuitive manner [9]. The invention of the VisiCalc numerical spreadsheet in 1979 fueled the adoption of personal computers. A Stanford University researcher suggested in 1994 to use a spreadsheet-like interface for performing image processing, and demonstrated the utility of a prototype for that data domain [7]. Extending from the ideas contained in that paper, we believe the spreadsheet interface can be applied to other visualization domains. Further research is needed to identify situations in which the visualization spreadsheet can support tasks that were previously either difficult or impossible to accomplish.
In this project, we will try to find answers to three questions. The first question is, ``How is the visualization spreadsheet valuable for user tasks?'' We believe the the value of a visualization spreadsheet lies in enabling users to build multiple visual representations of data sets, perform operations on the visual representations, and compare and contrast the results visually. Because the spreadsheet provides a structured environment to perform tasks, it will significantly reduce the amount of time it takes to analyze data.
The second question is, ``What kinds of user tasks are supported by the visualization spreadsheet environment?'' We believe there are a large number of tasks that are particularly suitable. Here are some intuitive ones: (1) Tasks that involve applying a single operation to multiple visualizations, (2) Tasks that involve exploring similar features of different data sets, (3) Tasks that study the interactions between several different variables, and (4) Tasks that explore ``what-if'' scenarios.
The third question is, ``How general is the visualization spreadsheet paradigm?'' By selecting a wide variety of different types of data from several scientific disciplines, we will show the generality of the idea. We have examined the task requirements for genetic sequence analysis from molecular biology, usage pattern analysis from the World-Wide Web, and data mining from databases.
In Figure 2 we show the research plan for the Visualization Spreadsheet project. We take a prototype-driven research approach in studying how spreadsheet environments can be employed for visualization. In the first phase, we performed domain specific studies to gather user requirements. For each domain, we also performed an initial interface design. In the second phase, we performed an analysis of all requirements gathered in the first phase, and continue the design and implementation of the spreadsheet framework. In the final phase, we will evaluate the framework and specific applications using several evaluation methods.
We have constructed two prototype visualization spreadsheet systems. The first system is a domain-specific study on how spreadsheets can be structured and used in performing specific tasks in analyzing genetic sequence similarity reports. It provides a simple point-and-click interface using pull-down menus. The system is built upon the ideas in a previous system we call ``AlignmentViewer'' [4]. The system is designed for biologists and their task of comparing similarity reports. The second system is a general visualization spreadsheet built on top of the Visualization Toolkit (VTK) [9]. In addition to a point-and-click interface, it also provides a programming environment where users can define and enter their own commands. We will demonstrate the usage of this system called ``Spreadsheet for Information Visualization'' (SIV, pronounced ``sieve''), built on top of a multi-platform interpreted development system combining Tcl/Tk and VTK. We use VTK because it provides an object-oriented architecture with many pre-built objects.
Within an information visualization, much of the user tasks involve the application of operations, such as comparison, filtering, and animation. Because the primary elements are visual, the vocabulary for the spreadsheet is richer, resulting in more difficulties in the design of the user interface for these operations. Certain operators may take columns, rows, or a subgroup of cells as operands. In this paper, we show how the visual spreadsheet paradigm facilitates data exploration by enabling researchers to derive comparison datasets using operators such as set addition and subtraction. We also illustrate how the spreadsheet paradigm enables the parallel application of operators to a range of cells, facilitating visual comparison of values in the cells. By constructing a layout configuration, the user can set up an analysis templates which can then be applied to many datasets. In information visualization, another large problem involving user-system interactions is that, for a given data type, there are several different visual representations available at the user's disposal. Here we discuss how to use the spreadsheet paradigm to enable the exploration of multiple visual features in the spreadsheet simultaneously. By equipping the user with a set of operations, the user can explore datasets in their unique situations by combining the operations in various ways.
The spreadsheet paradigm provides a simple interface for performing value operators that derive new datasets, such as subtraction and addition. Let us illustrate using an algorithm visualization example. In Figure 1, we show an algorithm visualization of 3D Delaunay triangulation, which forms tetrahedra from a set of 3D random points generated using random number generators. Even though the problem of 3D triangulation is well studied, it is still quite non-intuitive for many people. Traditional algorithm visualization techniques use animation techniques and sequential layouts to show successive steps in order to gain better insights. Here the columns show the results of the algorithm after 5, 6, 25, and 50 steps, from left to right respectively. Row 1 shows the point set using 3D scatter plots. Row 2 shows the transparent tetrahedra after 3D Delaunay triangulation has been performed. Row 3 represents the tetrahedra using edges between vertices.
By adding the geometric contents of cells together, the user can aggregate
visualizations together to create new representations. The last row (Row
4) aggregates several cells together to form new visualizations that show
differences between successive steps. Cell
shows the difference between step 5 and 6, whereas
shows the difference between step 6 and 25. We can see where new points
were added into the point set, as well as the structural changes in the
convex hulls between steps. In cell
, we see the convex hull after 25 steps is completely embedded inside the
convex hull obtained after 50 steps. Since we know that adding points to
the triangulation can only increase the size of the convex hull, this discovery
makes sense. We see the blue surfaces and vertices where the convex hull
has not changed. Cell
shows the aggregate of adding all of the stick models in Row 3 together.
These representations are discovered after many iterations of trying different
combinations of the points, sticks, and surface representations of the
data in Row 1, 2 and 3.
Interestingly, these algebraic operations can take on different semantics at multiple levels. At the low level, we can capture the cell images and perform image subtractions, which is done by subtracting corresponding pixels. At the mid level, as shown in the above algorithm visualization example, we can perform geometric object algebraic operations. We can define objects and algebraically add them to or subtract them from the scene. At the high level, we can perform algebraic operations based on the particular data domain semantics.
Figure 3: A screen snapshot of visualizing sequence similarity
reports after performing three operations. (Step 1) Initially, we loaded
each column with a slightly different, but related, dataset (A1=B1=C1=D1,
A2=B2=C2=D2, A3=B3=C3=D3).
(Step 2) We selected Row B, and then subtracted cell A3 from
it (B1=B1-A3, B2=B2-A3, B3=B3-A3).
Cell B3 contains the empty set as expected. (Step 3) We changed
Row C and D to show different views of Row A. The
views show different sets of variables using a different representation,
thus increasing our ability to see other dimensions of the multivariate
datasets simultaneously.
We encountered the need to examine domain semantics for operators in a domain study with molecular biologists exploring DNA sequences, who often compare a given sequence against a database of known sequences, generating thousand-page long reports of possible similar regions (alignments), and other information useful to biologists. Based on AlignmentViewer, a previous visualization system we built for this data [4, 3], we constructed a spreadsheet for the research task of comparing similarity reports. The basic 3D visual representation consists of comb-like glyphs that show the alignments, how similar they are, and where they occur along the input sequence. For example, see cell A1 in Figure 3. This spreadsheet is built using OpenGL and Motif using C++, and includes a computational steering environment for rapidly executing the similarity algorithm on multi-processor machines. For analysis, it provides animation, filtering, and variable-to-axis mapping capabilities.
Molecular biologists are interested in locating differences between several algorithm runs with different algorithmic parameters. Figure 3 shows a snapshot of an example session that is the result of a three step analysis:
The ability to generate comparison datasets is important in the process of exploring the differences between related datasets. If we know the domain semantics, we can apply this spreadsheet principle to enable users to algebraically explore differences between datasets. The addition and subtraction operation shown here is a typical case of comparing two similar, but not identical datasets, something of interest to researchers in many fields. The spreadsheet approach makes such algebraic manipulations straightforward.
Other than algebraic operators and simple scene operations, we have found that other operations, such as animation and dynamic query filtering, are also useful under this principle. For example, by selecting a column of cells, the user can apply an animation operation to those cells simultaneously. Or the user can apply a data filtering operator to a row of data to cut out unwanted data points. As a concrete example, in Step 1 of our sequence similarity example in Figure 3, we load the datasets by first selecting a column by clicking on the column button, and then applying a load-dataset operator to all the cells in that column. In Step 2, we subtract Cell A3 from Row B by first selecting Row B and then applying a subtraction operator to all the cells in that row.
This principle is important because the ability to distribute a single operation across a group of datasets is a common interaction in data exploration. We speed up users' tasks by automating the chore of applying operations that needs to be applied to a large number of cells.
The algorithm visualization of Figure 1 shows several different visual representations of a 3D Delaunay Triangulation. Row 1 shows the point set as 3D scatter plots, which shows the spread of the points quite well in this representation. Row 2 shows the same data using transparent tetrahedra after 3D Delaunay triangulation has been performed on the point sets. Through interactive rotation, this representation gives a better view of the relative placement of the points. It also shows the convex hulls of the point sets, and how the hulls change between steps of the algorithm. Row 3 represents the Delaunay triangulation as edges rather than tetrahedra, thus giving a better view of the interior structure of the triangulation.
Our sequence similarity spreadsheet also allows changing of visual representation via a mapping tool. In Figure 3, the cells in Row C and D contain the same datasets as the corresponding cells in Row A, but we changed the mapping in Row C and D to show different variables of the similarity report. In this organization, the cells in a given column represent the same value; however, each row offers a different view of the data. The ability to map different variables to different axes in different cells results in an improved ability to see more variables simultaneously. In this spreadsheet, the operations are accomplished via a click-and-point interface. The user loads the columns with data one column at a time, and changes the mapping of the data of each row using the mapping tool dialog box. The mapping tool is implemented as a pull-down menu for each axis.
Figure 4: Visualization of time-series matrices. The screen
snapshot shows visualizations of protein residue substitution probability
matrices of various evolutionary distances. The first, second, and third
rows visualize matrix 40, 120, and 250 from the PAM matrix series. The
fourth row visualizes matrix 62 from the BLOSUM matrix series. The first
column uses a cube representation that maps positive matrix values to the
volume, height, and color attributes of the cubes. The second column uses
a carpet plot that maps values to the height and color of a 3D surface.
The third column uses a bar representation that maps values to the length,
height, and color attributes of the bars. The fourth column shows various
representations in different rotational configurations.
Let us show this using the domain of time-series matrices, which is another type of data that presents challenges of the type commonly encountered in information visualization. Two major difficulties arise in dealing with time-series matrices. The first difficulty is identifying differences in the matrix values between successive matrices. The second difficulty is that different representations extract different features, so an easy way to view and explore these several representations simultaneously is needed. For example, the 'cityscape' representation shows the matrix values as 3D blocks, whereas the 'heatmap' representation show the values as colored tiles [10]. Fortunately, the spreadsheet environment is excellent for dealing with these difficulties.
We encountered two matrix series in trying to solve problems with molecular
biologists, who are interested in studying the effect of evolution on genetic
sequences, which accepts certain substitutions of one amino acid by another.
PAM and BLOSUM are two matrix series with each matrix representing substitution
probabilities at a given evolutionary distance [6].
An element of a matrix
specifies
the relative probability that the amino acids i and j will
be substituted after a given evolutionary interval. A positive entry specifies
an accepted mutation that is more likely than random, whereas a negative
entry specifies less likely than random. The detailed nature of these matrix
series results in a large amount of information [6].
For example, these matrices are used in the calculation of similarity between
sequences. Biologists are very interested in understanding the nature of
these series of matrices due to their mathematical and biological complexity.
We used SIV to try to gain a better understanding of PAM and BLOSUM, which are two matrix series that were calculated from different sets of information sources. To understand the differences between the matrices, it is important to be able to visually compare a number of different matrices simultaneously. In Figure 4, the first, second, third, and fourth rows of cells visualize the PAM40, PAM120, PAM250, and BLOSUM62 matrix, respectively. We found that the ability to quickly bring in data and lay them out in different ways to be extremely useful. For example, after 7 lines of commands, the last row shows the BLOSUM62 matrix.
By constructing several modules for different visual representations of matrices, we used our spreadsheet to answer specific scientific questions on these amino acid substitution time-series matrices. In Figure 4, the tabular layout is used to show different visual representations in different columns. Across each row, the values in the cells are the same, but we vary the visual representation to bring out different features of the dataset. We used it to discover several novel patterns in these matrices. The first column uses a cube representation that maps positive matrix values to the volume, height, and color attributes of the cubes. This representation shows the interesting variation of the diagonal entries more clearly than the other representation methods. The entry represented by the orange cube varies more than any other entry. The second column uses a carpet plot that maps values to the height and color of a 3D surface (using a rainbow colormap with negative entry mapped to red). The carpet plot technique shows that the matrices have different ranges of values (i.e. the colors get brighter and brighter from top to bottom). The third column uses a bar-plot representation that maps values to the length, height, and color attributes of the bars. The bar-plot technique makes comparing a specific entry from matrix to matrix easy, and shows the overall trend of most off-diagonal entries to decrease. The fourth column shows various representations in different rotational orientations.
By vertically scanning the spreadsheet, the user can detect differences between matrices quickly. As we can see from all the columns, the diagonals of these matrices have strong values, which makes sense since the identity substitution (no mutation) is favored by evolution. From the second column we see that the matrices are quite different because the colors get brighter and brighter from top to bottom. The last row shows the BLOSUM62 matrix, and we see its values are clearly different from any of the PAM matrices shown.
We found the ability to propagate the view changes in parallel to multiple cells to be highly valuable in this data analysis situation. By selecting a row, we can compare the various visual representations in the same orientation. Or alternatively, we can select a column and compare different matrices using the same visual representation.
Our experience shows the elegant organization of the spreadsheet allows interesting ways of combining different visual representations of the underlying data. Users can compare and visually extract different features from the different representations. The spreadsheet environment equips users with the necessary tools to explore the representation space.
The Visualization Spreadsheet is a powerful environment that enables users to more effectively explore available information. Computer users will care about such tools since it will help them interpret information and enable exploration tasks that were previously impossible. It is conceivable that one day there will be a visualization spreadsheet available on every desktop computer just as most computers have numerical spreadsheets today.