Mining Spatial Data: the University of Minnesota Leads the Way Research done at the University of Minnesota by Professor of Computer Science Shashi Shekhar and his group has made major contributions to the field of spatial data mining, a field whose importance is growing with the increasing incidence and importance of large geo-spatial data sets such as maps, repositories of remote-sensing images, and the decennial census. Applications of spatial data mining can be found in location-based services in the M(mobile)-commerce industry, and in government agencies such as US Army to infer enemy tactics e.g. flank attack, NASA for climatology (effect of el-nino), as well as land-use classification of satellite imagery, Transportation agencies for detecting local instability in traffic, NIH for epidemiology (predict spread of disease), NIJ for finding crime hot spots, NIMA for creating high resolution three-dimensional maps from satellite imagery, etc. Differences between classical and spatial data mining are similar to the difference between classical and spatial statistics. First, spatial data is embedded ina continuous space, whereas classical datasets are often discrete. Second, spatial patterns are often local where as classical data mining techniques often focus on global patterns. Finally, one of the common assumptions in classical statistical analysis is that data samples are independently generated. When it comes to the analysis of spatial data, however, the assumption about the independence of samples is generally false because spatial data tends to be highly self correlated. For example, people with similar characteristics, occupation and background tend to cluster together in the same neighborhoods. In spatial statistics this tendency is called spatial autocorrelation. Ignoring spatial autocorrelation when analyzing data with spatial characteristics may produce hypotheses or models that are inaccurate or inconsistent with the data set. Thus classical data mining algorithms often perform poorly when applied to spatial data sets. Thus new methods are needed to analyze spatial data to detect spatial patterns. Roots of spatial data mining lie in spatial statistics, spatial analysis, geographic information systems, machine learning, image analysis, and data mining. Several departments such as Electrical Engineering, Bio-statistics, Geography, Forest Resources, Epidemiology, Psychology and research centers such as AHPCRC, IMA, CTS, CURA, Precision Agriculture, Cancer Center in University of Minnesota are contributing to the field. In fact IMA is organizing a workshop on Spatio-temporal Patterns in the Geosciences last week of September 2001 and a series of workshops related to Mathematics of Geoscience in 2001-2002. Computer Science faculty members interested in this topic include Prof. Dan Boley, Prof. Ravi Janardan, Prof. George Karypis, Prof. Vipin Kumar, and Prof. Nikos Papanikolopoulos and Prof. Paul Schrater. Prominent alumni in this field include Jack Dangermond (President, Environmental Systems Research Institute), Dr. Raju Namburu (Army Research Lab.) and Dr. Siva Ravada (Manager, Spatial Data Group, Oracle Corporation). Main contributions made by Computer Science researchers to this area includes algorithms and data-structure that can scale upto massive (terabytes to petabytes) datasets as well as formalization of newer spatio-temporal patterns (e.g. co-locations) which were not explored by other research communities due to computational complexity. Spatial data mining projects in Computer Science department includes discovering spatial co-locations, detecting spatial outliers and location prediction. Co-location pattern discovery process finds frequently co-located subsets of spatial event types given a map (see Figure 1) of their locations. For example, analysis of habitats of animals and plants may identify co-location of predator-prey species, symbiotic species, and fire events with fuel, ignition sources etc. Readers may find it interesting to analyze the map in Figure 1 to find co-location patterns. There are two co-location patters of size 2 in this map. Our group has provided one of the most natural formulation as well as first algorithms for discovering co-location patterns from large spatial datasets and applying it to climatology data from NASA. Spatial outliers are significantly diffrent from their neighborhood even though they may not be significantly different from the entire population. For example, a brand new house in an old neighborhood of a growing metropolitan area is an spatial outlier. Figure 2 shows another use of spatial outliers in traffic measurements for sensors on I-35W (north bound) for a 24 hour time period. Sensor 9 seems to be a spatial outlier and may be a bad sensor. Note that the figure also show three clusters of sensor behaviours, morning rush hour, evening rush hour, busy day-time. Spatial statistics tests for detecting spatial outliers do not scale up to massive datasets, such as Twincities traffic dataset measured at thousands of locations in 30-second intervals and archived for years. We generalized spatial statistics tests to spatio-temporal datasets and developed scalable algorithms for detecting spatial ouliers in massive traffic datasets. Location prediction is concerned with discover a model to infer locations of a spatial phenomenon from the maps of other spatial features. For example, ecologist build models to predict habitats for endangered species using maps of vegetation, water bodies, climate and other related species. Figure 3 shows maps of nest location, vegetation and water used to build a location prediction model for red-winged blackbirds n Darr and Stubble wetlands on the shores of Lake Eries in Ohio. Classical data mining techniques yield weak prediction models as they do not capture the auto-corrlation in spatial datasets. We provided a formal comparison of diverse techniques from spatial statistics (e.g. spatial autoregression) as well as image classification (e.g. Markov random field based Bayesian classifiers) and developed scalable algorithms for those. Courses on Scientific Databases (Csci 8705, Fall 2001), Data Mining (Csci seminar), Spatial BioStatitics (PubH 8436) are wonderful oppotunities to learn more about these topics. Websites archiving recent research publication on the topic include www.cs.umn.edu/shashi-group and db.cs.sfu.ca/GeoMiner/