2 Background and Motivation



next up previous
Next: 3 Alignment Viewer's Representation Up: Visualization of Biological Previous: 1 Introduction

2 Background and Motivation

The field of molecular biology is rooted in the desire to determine the genes within the cells of organisms, the function of the proteins that these genes encode, and how these proteins are related evolutionarily across organisms. Genes, composed of DNA, are represented as discrete sequences of nucleic acids, also called bases. Proteins are represented as discrete sequences of amino acids, also called residues. Among biological organisms, genes and their proteins are related through evolution, and share common functions. As molecular biologists discover new genes and the function of corresponding proteins, their dataset of known information increases. This information is being cataloged in the form of DNA sequence databases for genes, and amino acid sequence databases for proteins.

The advent of DNA automated sequencing, has triggered an enormous growth rate in the public databases. Large-scale sequencing projects are being conducted on many organisms [21][14][17][1][20] and are producing vast amounts of new DNA sequence data. GenBank, the primary source of DNA sequence data, contains roughly 250,000,000 nucleotides in 270,000 sequences, and is doubling every 1.3 years [7]. The largest protein sequence database is the Protein International Resource (PIR) [10]. PIR contains roughly 12,000,000 residues and 42,000 sequences, and is doubling every 2.4 years.

Traditionally, painfully detailed lab experiments are designed and carried out to determine the function of the proteins. This is still a relatively slow process. One method used to improve protein function determination is to search the databases of known sequences for similarity to an unknown sequence [5][11][18][2]. Similarity algorithms are a well-developed aspect of computational molecular biology research [6][3][26][12][25][23][19], and employ dynamic programming and heuristic search techniques. BLAST [4] and FASTA [22] are the most popular database search algorithms in use today. These algorithms identify similar regions between an input sequence and all sequences in the databases of known DNA and protein sequences. These similar regions are called alignments. The results of these similarity searches allow biologists to formulate hypotheses on the possible functions of the query sequence, which are then confirmed in the wet lab.

The difficulty in using similarity searches as a starting point is that the time required to interpret the large textual reports is increasing. Fuchs remarked on the difficulty of this interpretation by noting that as more sequence data are gathered ``data interpretation is likely to become the time-limiting factor in genome analysis [9].''

Previous work in biological sequence visualization concentrated on single sequence representations, which are alternatives to the DNA alphabet. The H-Curve, a 3D curve defined iteratively, was suggested by Hamori and Ruskin to represent the composition of a long DNA sequence [13]. H. Jeffrey developed another iterative method, the chaos game representation [15]. Wu extrapolated the work of Hamori and Jeffrey and presented a third iterative method called W-Curves [28].

While single sequence representations can find interesting features of individual sequences, such as repetition of nucleotides, they are difficult to use for comparing sequences. Using single sequence representations, comparison of two sequences would involve detailed visual inspections of a pair of three-dimensional curves or 2D plots. Therefore, they do not seem appropriate for finding similarities between sequences. Moreover, while single sequence representations are very good for viewing small amount of sequence data, they are simply not designed for looking at large datasets.

Instead of devising a new method for representing DNA sequences, we focus on a method for representing the large amount of data output from sequence similarity algorithms in a more comprehensible manner. This problem serves as our motivation for the development of new visualization techniques that enable biologists to discover sequence relationships that are very difficult, if not impossible, to determine from textual reports. This solution was accomplished first by determining what information in the reports is most useful to biologists, and then by employing visualization techniques that allow them to interpret that information.



next up previous
Next: 3 Alignment Viewer's Representation Up: Visualization of Biological Previous: 1 Introduction



Ed H. Chi (echi@cs.umn.edu)
Fri Apr 28 12:51:35 CDT 1995

The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.