Scientific disciplines are confronted with an increasing amount of data and few tools or techniques for extracting meaningful information from it. The question of what to visualize and the related question of how to visualize it makes dealing with large, multi-dimensional datasets one of the most important and exciting areas of scientific visualization today. In this paper, we present a novel representation for dealing with this kind of data in the context of an area in molecular biology.
Molecular biologists study the function and evolutionary relationships of genes and proteins in cells. The genetic data is represented as sequential strings of letters, or sequences. Biologists use the large amount of known sequence data that now exists to help determine the function of new sequences. The similarity search algorithms developed for this purpose currently produce textual output.
Similarity reports for some sequences produce hundreds or thousands of pages of text. The similarity information produced in these textual reports is proportional to the size of the databases of known sequences, which are growing rapidly. Ironically, as more information about possible function becomes available, the task of exploring similarity results and determining function becomes increasingly difficult. Visualization methods are needed for biologists to effectively explore the enormous amount of information available in the databases.
The similarity analysis information produced is multi-dimensional, because it contains several orthogonal pieces of information for each similar region found. This discrete, multi-dimensional data lacks a natural visual representation. The work presented here is the result of our efforts to determine a useful representation for the large dataset of similarity results produced for new sequences.
In this paper, we present Alignment Viewer (AV), a novel visualization tool for the large, discrete, and multi-dimensional dataset resulting from sequence similarity analysis. The contributions of this work are:
In the next section, we present some background in computational molecular biology and previous work in visualizing DNA sequences. In section 3 we discuss AV's method for representing sequence analysis data. Section 4 contains case studies of how AV is used in practice, illustrating the features of AV and demonstrating how it is useful to biologists for examining similarity data. Finally, in section 5 we present future work and concluding remarks. We have also included a glossary of important terms from molecular biology at the end of the paper.
The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.