Mining Spatial Data:  the University  of Minnesota Leads the Way

Research done at the University of Minnesota by Professor of Computer
Science Shashi Shekhar and his group has made major contributions to the
field of spatial data mining, a field whose importance is growing with the
increasing incidence and importance of large geo-spatial data sets such as
maps, repositories of remote-sensing images, and the decennial census.
Applications of spatial data mining can be found in location-based services
in the M(mobile)-commerce industry, and in government agencies such as US
Army to infer enemy tactics e.g. flank attack,  NASA for climatology
(effect of el-nino), as well as land-use classification of satellite
imagery, Transportation agencies for detecting local instability in
traffic, NIH for epidemiology (predict spread of disease), NIJ for finding
crime hot spots, NIMA for creating high resolution three-dimensional maps
from satellite imagery,  etc. 

Differences between classical and spatial data mining are similar to the
difference between classical and spatial statistics. First, spatial data is
embedded ina continuous space, whereas classical datasets are often
discrete. Second, spatial patterns are often local where as classical data
mining techniques often focus on global patterns. Finally, one of the
common assumptions in classical statistical analysis is that data samples
are independently generated. When it comes to the analysis of spatial data,
however,  the assumption about the independence of samples is generally
false because spatial data tends to be highly self correlated. For example,
people with similar characteristics, occupation and background tend to
cluster together in the same neighborhoods. In spatial statistics this
tendency is called spatial autocorrelation.  Ignoring spatial
autocorrelation when analyzing data with spatial characteristics may
produce hypotheses or models that are inaccurate or inconsistent with the
data set.  Thus classical data mining algorithms often perform poorly when
applied to spatial data sets. Thus new methods are needed to analyze
spatial data to detect spatial patterns.

Roots of spatial data mining lie in spatial statistics, spatial analysis,
geographic information systems, machine learning, image analysis, and data
mining. Several departments such as Electrical Engineering, Bio-statistics,
Geography, Forest Resources,  Epidemiology, Psychology and research centers
such as AHPCRC, IMA, CTS, CURA, Precision Agriculture, Cancer Center in
University of Minnesota are contributing to the field. In fact IMA is
organizing a workshop on Spatio-temporal Patterns in the Geosciences last
week of September 2001 and a series of workshops related to Mathematics of
Geoscience in 2001-2002. Computer Science faculty members interested in
this topic include Prof. Dan Boley, Prof. Ravi Janardan, Prof. George
Karypis, Prof. Vipin Kumar, and Prof. Nikos Papanikolopoulos and Prof. Paul
Schrater. Prominent alumni in this field include Jack Dangermond
(President, Environmental Systems Research Institute), Dr. Raju Namburu
(Army Research Lab.) and Dr. Siva Ravada (Manager, Spatial Data Group,
Oracle Corporation). Main contributions made by Computer Science
researchers to this area includes algorithms and data-structure that can
scale upto massive (terabytes to petabytes) datasets as well as
formalization of newer spatio-temporal patterns (e.g. co-locations) which
were not explored by other research communities due to computational
complexity. Spatial data mining projects in Computer Science department
includes discovering spatial co-locations, detecting spatial outliers and
location prediction. 

Co-location pattern discovery process finds frequently co-located subsets
of spatial event types given a map (see Figure 1) of their locations. For
example,  analysis of  habitats of animals and plants may identify
co-location of predator-prey species, symbiotic species, and fire events
with fuel, ignition sources etc. Readers may find it interesting to analyze
the map in Figure 1 to find co-location patterns. There are two co-location
patters of size 2 in this map. Our group has provided one of the most
natural formulation as well as first algorithms for discovering co-location
patterns from large spatial datasets and applying it to climatology data
from NASA.

Spatial outliers are significantly diffrent from their neighborhood even
though they may not be significantly different from the entire population.
For example, a brand new house in an old neighborhood of a growing
metropolitan area is an spatial outlier. Figure 2 shows another use of
spatial outliers in traffic measurements for sensors on  I-35W (north
bound) for a 24 hour time period. Sensor 9 seems to be a spatial outlier
and may be a bad sensor. Note that the figure also show three clusters of
sensor behaviours, morning rush hour, evening rush hour, busy day-time.
Spatial statistics tests for detecting spatial outliers do not scale up to
massive datasets, such as Twincities traffic dataset measured at thousands
of locations in 30-second intervals and archived for years. We generalized
spatial statistics tests to spatio-temporal datasets and developed scalable
algorithms for detecting spatial ouliers in massive traffic datasets.

Location prediction is concerned with discover a model to infer locations
of a spatial phenomenon from the maps of other spatial features. For
example, ecologist build models to predict habitats for endangered species
using maps of vegetation, water bodies, climate and other related species.
Figure 3 shows maps of nest location, vegetation and water used to build a
location prediction model for red-winged blackbirds n Darr and Stubble
wetlands on the shores of Lake Eries in Ohio. Classical data mining
techniques yield weak prediction models as they do not capture the
auto-corrlation in spatial datasets. We provided a formal comparison of
diverse techniques from spatial statistics (e.g. spatial autoregression) as
well as image classification (e.g. Markov random field based Bayesian
classifiers) and developed scalable algorithms for those.

Courses on  Scientific Databases (Csci 8705, Fall 2001), Data Mining (Csci
seminar), Spatial BioStatitics (PubH 8436) are wonderful oppotunities to
learn more about these topics.  Websites archiving recent research
publication on the topic include www.cs.umn.edu/shashi-group and
db.cs.sfu.ca/GeoMiner/