to Axes Up: DESIGN
Our multivariate data come from alignment reports generated by biological
sequence similarity algorithms. Each report consists of an input sequence
and many alignments. Each alignment is a match between a subsequence of
the input sequence and a subsequence of a sequence from the database. Thus,
an alignment indicates a region of similarity between two sequences. Such
alignments are the basic elements in our visualization system. Associated
with each alignment is the following information:
- Position. The position in the input sequence where the alignment
- Frame, or frame number. The frame number defines how
the DNA sequence is translated into a protein sequence. DNA sequences are
composed of a four letter alphabet, which are used to encode the sequence
of nucleotides (also called a base). Three DNA bases encode
one protein residue (also called an amino acid). A DNA sequence
can encode a protein sequence starting from the first, second, or third
position. The starting point determines how bases are grouped into residues.
When comparing a DNA sequence to a protein sequence, each encoding is tried
- Length. The length of the alignment measured in residues.
- Entry Date. The submission date of the database sequence for
the alignment into the database.
- Similarity Scores and Residue Pair Scores. Similarity
algorithms such as BLAST compute the similarity score of each alignment
. For each pair of residues in
an alignment, BLAST looks up the entry in a substitution matrix
and gets the residue pair score, a measure of the match strength.
A positive entry corresponds to a good match, and a negative entry corresponds
to a bad match . BLAST then sums
all residue pair scores in the alignment to obtain the similarity score.
- P-value. The Poisson P-value for an alignment measures the statistical
probability that an alignment could have occurred by chance. Because of
its large range, P-value is commonly represented by the negative logarithm.
Thus, an alignment with P-value of 10^-45 is represented by 45.
- Percent Identities. The percentage of exact matches to
the total alignment length.
- Percent Positives. The percentage of positive matches
to the total alignment length.
- Bits. The amount of information in the alignment measured in
``bits'' using information theory .
- PAM Evolutionary Distance. Different substitution matrices allow
different degrees of mismatches and mutations. These matrices are either
experimentally or theoretically derived. The PAM (Point Accepted Mutations)
matrices use a rough measure of how many generations of evolution it would
take to mutate one sequence into another .
For example, the PAM120 matrix allows fewer mutations than PAM250. We can
obtain a rough estimate of the evolutionary distance of the alignments
by statistically recomputing and normalizing the similarity scores obtained
by using different PAM matrices .
- BLOSUM Evolutionary Distance. The evolutionary distance measured
using the BLOSUM matrices, which are experimentally derived from sequence
data. In contrast to the PAM matrices, a low number signifies a large evolutionary
- Matrix Used. The number designating the substitution matrix
used for the alignment report. For a single input sequence, AV has the
ability to read in multiple alignment reports that were computed using
different substitution matrices.
The PAM and BLOSUM Evolutionary Distances are not given by the alignment
report, rather AV computes these as needed. The entry date is also not
given in the report, but is retrieved from a separate database as the report
is read into the system.
For an alignment, in addition to the above variables, there is the matching
vector itself. This vector, represented by an array of integers, contains
the residue pair scores of the matches starting from the first matching
position. Therefore, an alignment can be viewed as a twelve dimensional
point (for the above twelve variables), plus a matching vector.
to Axes Up: DESIGN
Ed H. Chi
Thu Jul 11 10:52:57 CDT 1996
The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.