Research Problems for Spatial Databases
This docuement has two sections.
Consider scrolling down to appropriate sub-section
if the entire document is not of interest.
First part describes three cateroies of projects, namely
Second part lists dozens of specific projects grouped in
- 1.1 statistical term-papers,
- 1.2 survey papers and
- 1.3 software implementation/demonstration projects.
- 1.4 other creative projects
- 2.1 Assorted Problems
- 2.2 Past, Present and Future of Database Research
- 2.3 Benchmarking
- 2.4 Data Models and Query Languages
- 2.5 Storage and Indexing
- 2.6 Query Optimization
- 2.7 Transactions, Recovery
- 2.8 Concurrency Control
- 2.9 Distributed and Parallel Databases
- 2.10 Data Warehouse
- 2.11 Spatial Networks
- 2.12 Data Mining
- 2.13 Rastor Data
- 2.14 Links to other collections of problems
1. Categories of Research Projects
There are three categories of projects, namely,
statistical term-papers, survey papers and software
General description of the three categories appear in this subsection.
Specific projects under each categories can be found from different sources.
My own research interests lie in the
area of spatial databases and spatial data mining. Some of the current
projects in this area are listed at
Problems of Current Interest.
Some of the open problems are listed at
in the second part of this document.
You are welcome to chose a project related to these problems
and interact with the spatial database research group.
Other faculty members with reesarch projects in database and data mining area
include Prof. Vipin Kumar, Prof. George Karypis,
Prof. John Carlis, Prof. Jaideep Srivastava, and
Prof. John Riedl. You are encouraged to visit them and browse their
web-pages to learn about their research projects. You may find
interesting ideas for course projects this way.
1.1 A Sample Statistical Literature Analysis Term-Paper
Statistical summary of the publication activity by topics and year
is often valuable to get the big-picture of research activities in
areas with vast literature.
Projects in this category will collect data about publications
and statistically summarize the publication activity
across different research topics (or validation methodologies used)
in spatial database forums in last 5 to 10 years.
Example papers reporting such results include
Walter F. Tichy, Paul Lukowicz, Lutz Prechelt, Ernst A. Heinz. Experimental Evaluation in Computer Science: A Quantitative Study. Journal of Systems and Software 28(1):9-18, January 1995. ,
Lutz Prechelt. Some Notes on Neural Learning Algorithm Benchmarking. Neurocomputing 9(3):343-347, December 1995. , etc.
Databases about published books (e.g.
and refereed journal/conference publications (e.g.
citeseer ) as
well as journal (e.g.
IEEE TKDE ,
GeoInformatica Journal ,
and conference proceedins (e.g.
IEEE DE Bulletin ,
are available on the web.
You can choose a sample subset of publication forms and collect data
via an internet crawler.
It may be helpful to prepare a summary chart, spatial visualizations
and other diagrams
to show the change in number of publications on each topic
by the year in conferences and journals.
Similar statistics on the methodology of choice would be useful.
You are welcome of think of informative statistical (and data
mining, knowledge discovery) tools and techniques to highlight trends.
The term paper should document the major results as well as the
data collection and analysis procedures.
An example of possible results is shown in Figure 3 in
C. H. Papadimitriou, Database Metatheory: Asking the Big Queries,
Proceeding of 1995 Conference on Principles of Database Systems, ACM SIGMOD
(reproduced on pages 656 of 3rd edition of Readings in Database Systems,
Edited by M. Stonebraker and J. M. Hellerstein, Morgan Kauffman,
More general analysis related to cross-citations include
Impact factor (essay) ,
and journal ratings .
1.2 Survey Papers
Survey the publications within a specific research topics
within database forums in last 5 to 10 years. Sample topics
include topics within spatial databases such as conceptual modelling of
spatial data, indexing and querying collections of moving objects,
vector map compression, spatial data mining, spatio-temporal databases,
mobile and wireless spatial databases, spatial data warehouses,
internet based spatial databases, semantic web, homeland security,
using spatial indexes for content based
You may find more topics from the call for papers (see Symposium on Spatial
Databases, or ACM Workshop on GIS, UCGIS Research Agenda
websites) from latest conferences on databases.
You may consider updating a recent survey paper. This will reduce
your literature survey work to the publications since the selected
survey paper was prepared (usually a year before publication).
UCGIS has a set of short position papers on emerging topics in spatial
databases and Geographhic Information Systems.
Extensive sources for survey papers include ACM Computing Surveys,
IEEE Computer (e.g. Embedded Databases survey in 9/2000 issue),
and Communications of the ACM.
A recent issue of IEEE Transactions on Knowledge and Data Eng. (January 1999)
also featured a number of survey papers on spatial database topics.
Conferences often feature tutorials on special topics and the notes (if
available) can be useful sources.
Another worhtwhile project is to create/update bibliography for papers in
spatial databases, spatial data mining and related topics.
The term paper should summarize major accomplishments and
next challanges. The format of the survey paper may resemble
those used in survey papers presented in prestigious computer science
journals mentioned above.
1.3 Sample Software Demo Projects
The research aspects of software demo project may be structured by
addressing the following questions:
- Functional Requirement analysis: Describe typical usage scenarios.
Identify the information and
computation needs to support the major activities. (Are some of these
geographic in nature?)
- Non-functional requirement analysis:
Specify performance needs in areas like response time and throughput.
Identify constraints posed by hardware and software platforms (e.g.
Personal Digital Assitants or PDAs) if relevant.
- Propose a benchmark for designing and evaluating the software
applications by identifying the key datasets, data types,
computations and queries.
- Identify the key design issues in designing the software.
Select the a few most critical decisions.
- List alternative ways of resolving the top few design issues.
Restrict your choices to 2 to 3 alternative for each design issue.
- Compare the alternatives for one of the design issue.
Identify the comparison methodology. List preliminary results
characterizing the dominance zone of the alternatives.
- Develop a prototype to demonstrate your work and write
a short paper documenting the key findings.
1.4 Other Creative Projects
- Create a timeline of historic events in databases. Resources include the
websites about history of IBM.
- Develop a 12 month calendar around database concepts.
Dates may note historic events, facts, cartoons etc. about databases.
- Develop a set of laboratory exercise to supplement the
courses on spatial databases.
2. Research Problems in Spatial Databases
Csci 8705 (Spatial Databases)
in Fall 2001 and
Csci 8701 (Overview of Database Research)
in Fall 2000. These courses provided an opportunity to identify
research needs within spatial databases. Some of these
problems are listed below categorized into following topics:
- Assorted Problems
- Past, Present and Future of Database Research
- Data Models and Query Languages
- Storage and Indexing
- Query Optimization
- Transactions, Recovery
- Concurrency Control
- Distributed and Parallel Databases
- Data Warehouse
- Spatial Networks
- Data Mining
- Rastor Data
These problems are suitable for
course projects leading to MS and PhD thesis.
Interested researchers are encouraged to contact
me via email or during office hours.
2.1 Assorted Problems
Keyhole markup language (KML)
from Google Earth allows one to publish
geo-coded data on the web in forms of maps. One may use KML to publish
a few local datasets related to the university or Twincities.
A better project would be to develop a small set of labs which can be
carried out by students in Csci 8715.
Even better project
if to evaluate web-mapping capabilities of KML with respect to
Arc/IMS, and Java based mapping tools.
- Internet crawlers have been used for many novel application ranging
from search engines, comparison shopping, financial information aggregation,
etc. An interesting application withing the domain of publishing relates to
A specific opportunity is related to bibliography and citations aspects
of technical writing in computer science. Internet has several
bibliography databases (DBLP, citeseer, cora) with latex/bibtext citation
entries for publication for literature in Computer Science.
The goal of this project would be develop a tool similiar to "Spell Checker"
for bibliography entries in a technical paper. Given a partial
bibliography entry (e.g. some authors, part of title, publication forum),
planned software will bring up closest matches from the internet
bibliography databases to allow authors to choose from. Components of such a
system include a internet crawler to query the databases, html/xml parsing
to extract latex entries, text matching to determine best matches, etc.
Note: A recent Plan B project in the group develop crawlers to
collect publications from database conference to track trends, e.g. publication
activity for various sub-areas over last decade. Part of the system may be
reused for this project.
Line generalization or simplification
is a well studied problem in cartography. A linestring can be simplified
by choosing a subset of the points in the linestring or via chain coding.
Develop efficient algorithms for a related problem to simplify a set of given
line-strings in a map under the constraint of preserving topological
relationships among them. For example, consider a contour map showing
isolines for elevation. The goal is to simplify each of the isoline (i.e.
polygon) in the map. Topological relationships to be preserved include
inside, touch, etc. For example, if a isoline I1 was completely inside
another isoline I2 in the original map then the simplified(I1) should
inside simplified(I2). If a isoline I3 was touching another isoline I2
at exactly one point in the original map then the simplified(I3) should
be touching simplified(I4) at one point. Test your solutions with
a couple fo different types of maps, e.g. contour maps, road-maps, etc.
- Personal Information Managers
An interesting project relates to map based spatial database
for personal information management (PIM) support typical GIS applications
such as the following:
You may explore other aspects such as integration with a global
positioning device (GPS), wireless communication to a web srever for
dynamic loading of dataset, map compression etc.
- Browse a map of University campus
- Locate and orient user on the map via visible landmarks around her
- Locate specific buildings or departments in the University campus
- Locate facilities (e.g. parking) near specific points of interests.
- Request determination of a walking or driving route between two buildings on campus
- Request landmark based directions for walk from a building to another
- Map accuracy evaluation:
Evaluation of digital map data accuracy often starts with visual
comparisons with references, e.g. aerial or satellite imagery.
Goal of this project is to help visualize quality of a digital road map
(e.g. TIGER files, or GPS tracks).
Develop a prototype software to produce overlay of satellite imagery
from terraserver.com and given digital map data for a small geographic
area. A useful reading in this area is the report titled
GPS TIGER Accuracy Analysis Tools (GTAAT) Evaluation and Test Results
- Location sensors inside buildings:
Location sensor for a mobile computing device inside a building
is an emerging new technology.
It will be interesting to brain storm
on the impact of this technology on spatial databases.
IEEE Computer (August 2001) has three survey articles on this topic.
It has applications in location aware mobile computing,
security, assest tracking,
information protection, routing/navigation inside buildings,
and smart wheel chairs as well as smart canes for visually impaired.
It will supplement other location technologies such as GPS and
cell-phones which are not very effective inside many buildings.
Resource on location aware mobile computing include
Location based services - a database perspective ,
guess editor intro to special issue of IEEE computer ,
context aware mobile computing papers at citeseer ,
A survey of context aware mobile computing (Chen/Kotz,
Dartmouth TR 2000-381) ,
NAS workshop Geospatial Tech. and IT - location aware .
Other relevant readings on web include
location technology overview articles (
mobile positioning ,
java location services.com: (
wireless location industry
location interoperbility forum
mobile positioning.com : (
contracts awarded ,
esri whitepaper ,
Snaptrack : (
spatial location protocol (SLoP) ,
and security technology:(
security magazine ,
ncjrs report - school
- Location based security policies - specification and implementation:
Access control policies based on location and other context (e.g. time, activity) can solve many
security problems of current wired and wireless information systems. For example,
hackers from outside on's facility or region may easily be identified. However, new research is
needed to (i) specify location based policies, (ii) mechanism to enforce location based
policies, e.g. location authentication hardware and software, etc.
Consider develop methods to address location based security issues in policy specifications
and implementation mechanisms.
2.2 Past, Present and Future of Database Research
- Write a survey paper on an emerging subarea of databases,
e.g. spatial databases, spatial data mining, web databases, mobile
location based services
- Create spatial
visualizations for the collection of publications related to spatial
databases or spatial data mining or other topics. The project would
characterize of the goals for visualizing a collection of documents,
choose as well as compare of visualization modes.
Example goals are to assess the poupularity of topics,
interaction among topics, assess temporal trends, e.g. change
in popularity of a topic.
Candidate visualizations include
Examples of elevation maps for visualizing collection of
documents are available from cartia.com .
- elevation map: hills= cluster centers, height= cardinality
- Quad-tree space division : leaf = document, non-leaf= topic
- Subway maps: station= document(s), lines= topic
- XML is reinventing databases in many areas (e.g. query languages)
related to conceptual and logical databases.
Specific tag families (e.g. GML for spatial dataset) of XML are being
created for different data types. XML parsers (e.g. DOM, SAX) are becoming
widely available for manipulating XML data outside a database.
However, querying XML datasets is not scalable since entire datset
has to be parsed and processed even if a small part of the data is
relevant to the query at hand.
Explore incorporating physical database notions (e.g. indexing, clustering,
query processing) to process point, range, join queries on XML
datatypes. Design and perform experiments to evaluate the overheads
and usefulness of such information in XML datasets in context of a useful
- Mobile database (e.g. Sybase Anywher, Oracle lite, Synchrologic, ...)
are becoming important with the increase in popularity of PDAs and other
mobile computing platforms. Spatial applications (e.g. roadmaps, point
of interest, routing, proximity queries) are among the killer applications
in this domain. Explore the impact of mobile environment on spatial
Multi-copy undate and synchronization is a core problem in commercial mobile
databases. Existing products (e.g. Synchrologic) provide competent solutions
for files and relational table. GIS and CAD companies have also proposed
custom solutions. Evaluate these solutions for synchronizing spatial
datasets (e.g. roadmaps, point of interests). Are new solution needed in
- Develop a benchmark (e.g. Sequoia 2000) to evaluate database
implementations of some of the following:
Identify the components of the spatial DBMS being evaluated by the
- OGIS spatial data types and operations.
- Rastor / field data types and map algera operations
- Spatial graph computations (e.g. mapquest.com)
- Mobile/wireless spatial computations (e.g. vindigo.com)
- Design benchmarks, design experiments and perform
experimental studies to characterize the
dominance zone of one of the following components of a spatial DBMS:
- Spatial indexing methods,
- Strategies for range query processing,
- Strategies for processing nearest neighbor queries,
- Strategies for spatial join processing,
- Spatial clustering methods (space filling curves),
2.4 Data Models and Query Languages
- Summer 2007: Trajectories (e.g. GPS-tracks, cell-phone tracks, etc.) data-sets are
becoming available and have the potential of creating hour-by-hour
census for cities to facilitate understanding human behaviour in urban areas.
However, current data models (e.g. OGIS spatial data types) do not support
trajectories well. Challenges include the infinite-streaming nature of time-series
of measurements generated by location sensors such as GPS. In addition,
the measurements are neither precise nor perfect, e.g. GPS reports a mean,
a standard deviation and an error code. Design conceptual (e.g. a mathematically
rigourous construct with properties like completeness),
logical (e.g. a set of data types) and physical (e.g. storage and computation efficiency)
models to effectively represent trajectory datasets.
- Study the XML databases (e.g. data format and query language).
Design XML query language for supporting OGIS operators on
- Represent the following in XML Schema:
OGIS spatial data type hierarchy.
How would one model the inheritance (sub-types),
part-of relationships among OGIS geometry subclasses in XML ?
How will polymorphism for topological operations be modeled ?
- Represent the following in XML Schema: Entity Relationship diagrams.
How will one model relationships? integrity constraints?
- Represent the following in XML Schema:
pictogram (for OGIS spatial data types) enhanced entity relationship model.
How will pictograms (non text icons) be modelled?
- Develop a data model for the census demographics data
or TIGER files using UML/ERD and pictograms.
Compare the new data models with the current file format descriptions.
- Extend the pictograms (used in conjunction with Entity Relationship
the spatial data types and operation in OGIS to support
some of the following:
(a) Spatial graph computations (e.g. mapquest.com)
(b) Rastor / field data types and map algera operations
(c) Mobile/wireless spatial computations (e.g. vindigo.com)
- Develop a translator to generate logical object-relational schema
(SQL3 create table with OGIS data types, integrity constraints)
from a conceptual model (e.g. Entity Relationship model with
pictogram for spatial types).
- Design a set of data types and operators to model and query the following
- Direction (e.g. north, south, left, right, above)
- spatial network (e.g. roadmaps, utility networks)
- rastor data (e.g. satellite data)
- Learn Oracle spatial data cartridge. Write SQL3/OGIS expressions
for the spatial queries in supplementary textbook by Shekhar/Chawla.
Test the correctness of queries with a spatial data-set.
- Prepare a solution manual for all the problems in
the supplementaty textbook on spatial databases.
A broader project would be to prepare a few laboratory assignments
for spatial database courses.
2.5 Storage and Indexing
- Study the algorithms for finding nearest neighbor of a given
point using R-tree.
Design an I/O efficient algorithms to find the
nearest neighbor of a given point
for Grid File spatial access method.
Also design an algorithm for finding the reverse nearest neighbor.
- Characterize the access patterns, working set sizes of
spatial queries and query processing strategies.
Extend DBMIN to provide approvice appropriate buffering
strategies for spatial databases.
Compare your strategies with those used in the IEEE TKDE
paper on buffer management for R-trees.
- Study the indexing methods for mobile objects. Evaluate
the performance of alternative methods including classical spatial
indexing methods (e.g. R tree) to characterize the
2.6 Query Optimization
- Extend the idea of threading for the leafs of tree
based spatial access methods (e.g. R-tree) by linking nearest neighbors
in some (e.g. 4 or 6) or all directions.
Threads may improve the I/O efficiency of range queries,
nearest neighbor queries but increase overhead of updates.
Design new bottom-up algorithms for range queries and nearest
neighbor queries taking advantage of the threads.
Also extend the update algorithms to maintain the threads.
Consider B-link and R-link tree algorithms to get a start.
- Develop cost models to characterize approximate
dominance zone of the items listed in previous question.
- Study ICDE 2001 paper on caching techniques for B-tree (Graefe).
Evaluate the proposed techniques for improving the CPU-cache
performance for R-tree. Identify needed changes (e.g. is searching MOBRs
different from searching bitstreams?)
Design an CPU-cache efficient implementation of R-tree to
minimize cache misses.
- Consider using B+tree with a space filling curve to support spatial
query processing (e.g. Sequoia 2000 benchmark). A DBMS company hopes to
save cost of development of new access methods (e.g. R* tree) along with
the associated concurrency control, recovery, performance tuning
Develop the following towards this goal:
Develop processing strategies for point query, range query, nearest
neighbor, and directional queries over collections of points.
Develop processing strategies for point query, range query and nearest
neighbor query over collections of extended objects (MOBRS).
Develop processing strategies for spatial join with simple
join predicate e.g. overlap.
Compare B+tree with space filling curves with other other spatial
access methods (e.g. R* tree) for processing spatial queries.
Consider using Sequoia 2000 benchmark as the workload.
Develop cost models to characterize estimate the result sizes and
I/O costs for above strategies.
- More specific discussion follows in
citation (postscript file) .
2.7 Transactions, Recovery
- Evaluate the pros and cons of the two phase commit protocol for
mobile transactions (e.g. buying a air-line ticket) using a
personal information management (PIM) application on a hand-held
devices (e.g. PDA).
Unlike ATMs and desk-top computers, PDAs are connected to internet
and servers via unreliable wireless communication. These have no
disk and somewhat stable main memories (unless battery goes out).
Should there be a PIM database on the hand-held device?
2.8 Concurrency Control
Design an efficient locking protocol for consurrent operation
for Grid File spatial access method. Compare proposed approach with
the locking protocol for concurrent operations on
R-tree using a R-link data-structure.
2.9 Distributed and Parallel Databases
- Study Navathe's paper on grouping queries from mobile clients
to exploit common subexpressions and filter-and-refine strategies
to make database server scale up to a large number of
intermittantly connecting mobile clients.
- Limited bandwidth of many mobile networks make it difficult to
transfer spatial datasets. Explore COMPRESSION methods for transferring
spatial data (e.g. oad-map) from web servers to mobile clients (e.g. Palm
OS clients) in an efficient manner. Compression scheme should support
fast decoding since Palm OS clients have limited computational capability.
- Develop and evaluate parallel formulations of the following:
- Routing algorithms, e.g. Hierachical routing
- Spatial data mining algorithms,
e.g. SAR, MRF BC, co-location rules, S-outlier detection, etc.
- Mapcube algorithms
2.10 Data Warehouse
- Study the paper titled "Mapcubes: A Visualization Tool for Spatial
Data Warehouses" by C. T. Lu et al. Evaluate the syntax of mapcube
statement for its ability to specifiy visualization of different spatial
data warehouses. Explore efficient implementations of mapcube
implementation reusing existing software modeules (e.g. GIS, DBMSs).
For example, develop a translator to generate GIS commands, object-relational
queries (SQL3 select statements with cube operator, GIS commands)
from mapcube statements.
- Model census demographics data using a data warehouse.
What new constraints and requirements are placed this application ?
2.11 Spatial Networks
- Many routing applications require routes which do not violate the
turn restrictions (e.g. no left turn between 7am and 9am)
posted at road intersections. Design roadmap representations and
routing algorithms capable of honoring turn restrictions.
- Current hierarchical routing algorithms has brute enumeration of all
pairs of boundary nodes from fragments of source and destination.
Design new algorithms to reduce or eliminate the brute force enumeration.
- Nearest neighbor queries in road networks
is a core problem for location based services, e.g.
find K nearest gas stations for a vehicle on road network.
To simply the problem one may first solve it for a stationary
vehicle before examining the case of moving vehicles.
This problem has been explored in the literature
for euclidean distance based nearest neighbors but not
adequately for road network distance based nearest neighbors.
A related idea is the "Allocte" operation, which divides
a network given a set of servers into service areas.
The service areas provide quick solution to nearest (K = 1) neighbor problem.
It may be useful to think about extending "Allocate" operation or
single source routing algorithms (e.g. Dijktra's) to finding
One may also use classical eucliden distance nearest neighbor finding
algorithms (e.g. R-tree based ) as a filter step. After all
euclidean distance is a lower bound on the network based distance
between two points. Thus finding K-nearest euclidean neighbors can
provide alower bound on the distance to the Kth-nearest graph
distance neighbor. Maintaining a matrix of road-network distances
among service centers may help in deriving
an uppoer bound on distance to Kth nearest service to help
terminate the iterative search algorithms like Dijktra's.
Another useful idea is to take advantage of
(and refine) the indexing methods for graphs (e.g. CCAM or
super-graph structure used for hierarchical routing).
2.12 Data Mining
- Relational data is often mined for four common patterns,
e.g. associations, sequential associations, clusters, and classification.
Generalize the definitions of these patterns for the following data:
Example of generalizing patterns for spatial data
are described n more detail as separate projects later in this list.
- vector spatial data, e.g. point, line and polygon features
- rastor spatial data
- spatial network data, e.g. roadmaps
- graph data, e.g. E-R diagrams, control-flow graphs for programs
- mobile spatial objects, e.g. vehicle fleets, army units
- text data, e.g. homework assignments, published papers
- Consider a brain storming project to define interesting spatial patterns.
A pragmatic approach is to identify common patterns discussed in
literature on spatial analysis (e.g. hot spots, spatial trends),
spatial statistics (e.g. point process, lattice theory, geo-statistical
approaches, auto-correlation, interpolation),
geo-science (tele-connections, e.g. el-nino effects), and
data mining ( associations, classification, clustering).
Another approach is to examine the spatial cognitive concepts (e.g. location,
co-location, shape, direction, ...) in natural language to look for families
of patterns that be useful in different spatial applications.
An intersting project is to
catalog and classify spatial patterns using multiple approaches.
- Consider the extension of notion of association rules to
a notion of "co-location" in geo-spatial datasets.
Items are represented by types (e.g. McDonalds, Burger King, hospitals,
police stations, bars, ...).
Neighborhoods (e.g. circles of diameter D) represent transactions.
Neighborshoods are implicit in contrast with the traditional explicit
transactions (e.g. zipcodes). Number of neighborhoods is infinite in
contrast to the traditional finite number of transactions.
Co-location patterns is a pair consisting of a set ST of spatial types and
a set SI of interest measures. Example interest measures include
frequency, ratio of actual frequency to expected frequency assuming
uniform distribtion of type instance over space, join-selectivity,
spatial cross-correlation (for subsets of size 2) etc.
Which of the interest measures satisfy monotonicity properties typical of
support measure in traditional association rules?
Modify "apriori" algorithm for discovering traditional association rules
to discover co-locations from spatial datasets.
Yan et al. have proposed a non-tranaction based colocation miner
algorithm using a large number of spatial joins. It is relatively expensive.
Propose a scheme to decompose a spatial dataset into a set of transactions
and residual cut-edges. Design a algorithm using a hybrid of apriori and
colocation miner which has similar cost as apriori if cut-set size if very small.
- Spatial classification- Study the paper titled "Predicting Locations
Using Map Similarity (PLUMS) : A Framework for Spatial Data Mining" by
S. Chawla et. al (Proc. ACM SIGKDD 2000 Workshop on Multimedia Data Mining).
This paper explores spatial classfication problems related to rastor data sets.
Develop a problem formulation for location prediction or
spatial classification for vector data (e.g. point sets representing
locations of crime, police stations, bars, businessed, highways, etc.).
Other problems of interest include incorporation of spatial
auto-correlation in other classification models, e.g. decision trees,
- Spatial Outlier Detection- Develop efficient methods of detecting
outliers in spatial datasets based on
- difference between the values of attributes of an spatial objects and
the values of attributes in the neighborhood.
- hot spots, i.e. abnormal clusters of events such as cancer incidence
- changes in shapes of a spatial phenonmenon (e.g. normal variation in
shape and extent of hot water area in Pacific vs. el-nino or la nina.)
- Spatial Clustering- Clustering of spatial objects (e.g. elements
in a spatial framework, e.g. land parcels in an agricultural field)
have a couple of new issues. Nearest neighbor pairs may be constrained to
belong to common clusters. For example EM algorithm has been generalized
to NEM to observe this constraint.
Another interesting constraints relate to the cluster sizes, cluster
continuit in space etc.
You may consider generalizing other clustering algorithms (e.g. K-mean,
K-medoid, etc.) to observe neighborhood constraint as well as other constraints.
Secondly, statistical significance of clusters are important. In other words
we want to flag regions which are not clustered by chance (e.g. complete spatial
random process). For example, density of a clustered region should be significantly
different from the entire space. Consider generalizing density threshold based
clustering algorithms to efficiently detect regions with significant clustering of
Finally, one may look for "declustered" regions besides the "clustered" regions of space.
In other words regions with exceptionally low density may be of interest.
Measures of clustering go beyond density and include Moran's I, nearest neighbor index etc.
However, clustering algorithms in data mining rarely look at those measures.
Consider generalizing current clustering algorithms to efficiently work with
spatial clustering measures.
- Semi-supervised Learning- Clustering, association rules and outliers are often
considered unsupervised learning method, while classification is considered a supervised
method. Some researchers have proposed a hybrid of clustering and classification in terms
of semi-supervised learning where learning dataset is a mixture of labelled and unlabelled
samples. Semi-supervised learning improves performance of learned classifiers particularly
when the set of labelled dataset is very small. Consider generalizing the notion of
semi-supervised learning to spatial data mining.
- Analyzing spatio-temporal datasets from CAA :
Our group is collaborating with Political Scientist from
Concept Analysis Agency
on Forecasting Militarized Interstate Disputes (FORMID).
The dataset has several attributes for each pair of
country in the world for last several years. The goal
is to analyze the dataset to find well-known patterns
as well as new patterns.
General reading on the subject on web includes
1946-99 Dataset (Oslo, Uppsala) ,
Conflict Management and Peace Science Journal ,
Proc. Uppsala Euro conference 2001 , and
Euro 2001 - Singer Report on Correlates of War,
Euro 2001 - Brecke paper War data analysis
Last two references provide a number of interesting patterns in
the data. Spatial patterns include the most war prone countries,
country pairs, regions and continents.
Other patterns include the list of country-pair attributes most correlated
with militarized conflicts.
- Analyzing spatio-temporal datasets from NASA :
Our group is collaborating with Earth Scientists from NASA to study the
relationships between global climate and vegetation growth.
The dataset divides the surface of earch into 0.5 - 1 degree
lattitude/logitude grids and provide measurements for temperature,
pressure, precipitation, solar energy, vegetation growth etc. for
last several years. Earth scientists know several patterns, e.g. El-Nino
(warming of Pacific oceans near Peru) reduce vegetation growth in
some areas of the world and incerese vegetation growth in other areas.
It also impacts grape crops in California and affects wine production.
General reading on the subject on web includes
el nino resources ,
and effects on usa .
The challenge for data mining is to first discover some of the known
patterns to gain trust and confidence in the abilities of DM.
This may interest Earth Scientists to examine new patterns found by
data mining techniques for further analysis of statistical significance
and possible adoption by the scientific community.
There are several interesting spatial data mining problems.
First problem is to automate the postprocessing of patterns (e.g.
association rules using pixels as transactions, attribute events as item-types)
for spatial cohesiveness and explanation by other geographic features,
e.g. vegetation types, soil type, elevation etc.
Instead of manually producing maps for each association rules and
mathing those with maps of geographic features, one may consider automating
the post-processing. Better yet, one may bring postprocessing (e.g.
spatial cohesiveness, spatial similarity with regions of
other features such as vegetation type) into each iteration of association rule
finding algorithm (e.g.Aprioi). Beside saving manual labor of postprocessing,
it might provide additional filtering efficiencies if spatial measures
exhibit monotonicity. There is also an opportunity for incremental computation
of spatial measures across iterations of association rule finding algorithm.
Second opportunity relates to approahes to teleconnection discovery problem.
Teleconnection patterns in Earth Science identify pairs < OAi, LAi >,
where OAi is a spatially coherent ocean area,
LAi is a spatially coherent land area, and
changes in physical properties of OAi has been observed to
significantly affect vegetation growth (and climate) in LAi over time.
Example teleconnection is the affect of El-Nino (warming of areas of Pacific)
on climate in mid-northern-lattitudes (e.g. warmer winters in Minnesota).
Teleconnection discovery problem can be formulated using a bi-partite
graph where nodes represent land pixels and ocean pixels.
Weight(edges b/w a land pixel and an ocean pixel) = f(time-series
similarity between two pixels).
Weight(edges b/w two land pixels) = fL(time-series similarity, distance).
Weight(edges b/w two ocean pixels) = fO(time-series similarity, distance).
We find pairs, e.g. (a group L of land pixels, a group O of ocean pixels)
such that aggregate_similarity(time-series for land pixels in L, time-series
of ocean pixels in O) is above threshold.
Objective function may include correctness and computational efficiency.
Constraints include statistical significance of teleconnection.
Current solution approaches are based on two phases, i.e. clustering
land-pixels first using fL and clustering ocean-pixels using fO and then
find correlations among land-clusters and ocean-clusters.
Two phase solution can not guarantee optimality and one solutions are desired.
Alternate solution procedure can be based on partitioning of the bi-partite
graph which directly solves the problem.
- spatio-temporal searches on the web :
Have you ever felt the need to reorder the list of hits produced by web
search engines by recency or distance to your location? In general,
having spatio-temporal information may be useful to quickly identify the
items of interest in the vast number of matches found by search engines for
various key word. One way to address this problem is analyzing the
documents on the web to extract their temporal and spatial footprints,
which could be used in querying, ranking, mapping or indexing. A few
recent workshops in early 2000s have eexplored this topic and there is room
2.13 Rastor Data
- Duality of vector and rastor representation implies that one
should be able model and compute similar concepts in both models.
Explore this duality by designing rastor representations and alogrithms
for the following spatial applications:
- shortest path problem
- topological queries (e.g. inside, overlap, touch)
- directional queries (e.g. north, left )
- Review the design on terraserver.com to allow storage and query of
satellite imagery for entire earth. It uses relational database for
storing and querying spatial data. Propose and evaluate an alternate
design using spatial database concepts for this system.
- Consider a collection of rastor maps for the sea surface temperature
at different times. Device an outlier detection algorithm based on
- changes in shapes of a spatial phenonmenon (e.g. normal variation in
shape and extent of hot water area in Pacific vs. el-nino or la nina.)
2.14 Links to other collections of problems
10 projects (and one game): a research narrative
from gpster.net (July 23rd, 2007).