Research Problems for Spatial Databases

This docuement has two sections. Consider scrolling down to appropriate sub-section if the entire document is not of interest. First part describes three cateroies of projects, namely

1.1 statistical term-papers,
1.2 survey papers and
1.3 software implementation/demonstration projects.
1.4 other creative projects

Second part lists dozens of specific projects grouped in following topics:

2.1 Assorted Problems
2.2 Past, Present and Future of Database Research
2.3 Benchmarking
2.4 Data Models and Query Languages
2.5 Storage and Indexing
2.6 Query Optimization
2.7 Transactions, Recovery
2.8 Concurrency Control
2.9 Distributed and Parallel Databases
2.10 Data Warehouse
2.11 Spatial Networks
2.12 Data Mining
2.13 Rastor Data
2.14 Links to other collections of problems

1. Categories of Research Projects

There are three categories of projects, namely, statistical term-papers, survey papers and software implementation/demonstration projects. General description of the three categories appear in this subsection.

Specific projects under each categories can be found from different sources. My own research interests lie in the area of spatial databases and spatial data mining. Some of the current projects in this area are listed at Problems of Current Interest. Some of the open problems are listed at in the second part of this document. You are welcome to chose a project related to these problems and interact with the spatial database research group.

Other faculty members with reesarch projects in database and data mining area include Prof. Vipin Kumar, Prof. George Karypis, Prof. John Carlis, Prof. Jaideep Srivastava, and Prof. John Riedl. You are encouraged to visit them and browse their web-pages to learn about their research projects. You may find interesting ideas for course projects this way.

1.1 A Sample Statistical Literature Analysis Term-Paper

Statistical summary of the publication activity by topics and year is often valuable to get the big-picture of research activities in areas with vast literature. Projects in this category will collect data about publications and statistically summarize the publication activity across different research topics (or validation methodologies used) in spatial database forums in last 5 to 10 years. Example papers reporting such results include Walter F. Tichy, Paul Lukowicz, Lutz Prechelt, Ernst A. Heinz. Experimental Evaluation in Computer Science: A Quantitative Study. Journal of Systems and Software 28(1):9-18, January 1995. , Lutz Prechelt. Some Notes on Neural Learning Algorithm Benchmarking. Neurocomputing 9(3):343-347, December 1995. , etc. Databases about published books (e.g. amazon.com ) and refereed journal/conference publications (e.g. DBLP , citeseer ) as well as journal (e.g. IEEE TKDE , GeoInformatica Journal , IJGIS ) and conference proceedins (e.g. VLDB , SIGMOD , CIKM/ACMGIS , IEEE DE Bulletin , SSD ) are available on the web. You can choose a sample subset of publication forms and collect data via an internet crawler. It may be helpful to prepare a summary chart, spatial visualizations and other diagrams to show the change in number of publications on each topic by the year in conferences and journals. Similar statistics on the methodology of choice would be useful. You are welcome of think of informative statistical (and data mining, knowledge discovery) tools and techniques to highlight trends. The term paper should document the major results as well as the data collection and analysis procedures.

An example of possible results is shown in Figure 3 in C. H. Papadimitriou, Database Metatheory: Asking the Big Queries, Proceeding of 1995 Conference on Principles of Database Systems, ACM SIGMOD (reproduced on pages 656 of 3rd edition of Readings in Database Systems, Edited by M. Stonebraker and J. M. Hellerstein, Morgan Kauffman, ISBN 1-55860-523-1). More general analysis related to cross-citations include Impact factor (essay) , and journal ratings .

1.2 Survey Papers

Survey the publications within a specific research topics within database forums in last 5 to 10 years. Sample topics include topics within spatial databases such as conceptual modelling of spatial data, indexing and querying collections of moving objects, vector map compression, spatial data mining, spatio-temporal databases, mobile and wireless spatial databases, spatial data warehouses, internet based spatial databases, semantic web, homeland security, using spatial indexes for content based retrieval, etc. You may find more topics from the call for papers (see Symposium on Spatial Databases, or ACM Workshop on GIS, UCGIS Research Agenda websites) from latest conferences on databases. You may consider updating a recent survey paper. This will reduce your literature survey work to the publications since the selected survey paper was prepared (usually a year before publication). UCGIS has a set of short position papers on emerging topics in spatial databases and Geographhic Information Systems. Extensive sources for survey papers include ACM Computing Surveys, IEEE Computer (e.g. Embedded Databases survey in 9/2000 issue), and Communications of the ACM. A recent issue of IEEE Transactions on Knowledge and Data Eng. (January 1999) also featured a number of survey papers on spatial database topics. Conferences often feature tutorials on special topics and the notes (if available) can be useful sources. Another worhtwhile project is to create/update bibliography for papers in spatial databases, spatial data mining and related topics. The term paper should summarize major accomplishments and next challanges. The format of the survey paper may resemble those used in survey papers presented in prestigious computer science journals mentioned above.

1.3 Sample Software Demo Projects

The research aspects of software demo project may be structured by addressing the following questions:

Functional Requirement analysis: Describe typical usage scenarios. Identify the information and computation needs to support the major activities. (Are some of these geographic in nature?)
Non-functional requirement analysis: Specify performance needs in areas like response time and throughput. Identify constraints posed by hardware and software platforms (e.g. Personal Digital Assitants or PDAs) if relevant.
Propose a benchmark for designing and evaluating the software applications by identifying the key datasets, data types, computations and queries.
Identify the key design issues in designing the software. Select the a few most critical decisions.
List alternative ways of resolving the top few design issues. Restrict your choices to 2 to 3 alternative for each design issue.
Compare the alternatives for one of the design issue. Identify the comparison methodology. List preliminary results characterizing the dominance zone of the alternatives.
Develop a prototype to demonstrate your work and write a short paper documenting the key findings.

1.4 Other Creative Projects

Create a timeline of historic events in databases. Resources include the websites about history of IBM.
Develop a 12 month calendar around database concepts. Dates may note historic events, facts, cartoons etc. about databases.
Develop a set of laboratory exercise to supplement the courses on spatial databases.

2. Research Problems in Spatial Databases

I taught Csci 8705 (Spatial Databases) in Fall 2001 and Csci 8701 (Overview of Database Research) in Fall 2000. These courses provided an opportunity to identify research needs within spatial databases. Some of these problems are listed below categorized into following topics:

Assorted Problems
Past, Present and Future of Database Research
Benchmarking
Data Models and Query Languages
Storage and Indexing
Query Optimization
Transactions, Recovery
Concurrency Control
Distributed and Parallel Databases
Data Warehouse
Spatial Networks
Data Mining
Rastor Data

These problems are suitable for course projects leading to MS and PhD thesis. Interested researchers are encouraged to contact me via email or during office hours.

2.1 Assorted Problems

Keyhole markup language (KML) from Google Earth allows one to publish geo-coded data on the web in forms of maps. One may use KML to publish a few local datasets related to the university or Twincities. A better project would be to develop a small set of labs which can be carried out by students in Csci 8715. Even better project if to evaluate web-mapping capabilities of KML with respect to Mapserver , Arc/IMS, and Java based mapping tools.
Internet crawlers have been used for many novel application ranging from search engines, comparison shopping, financial information aggregation, etc. An interesting application withing the domain of publishing relates to citation databases. A specific opportunity is related to bibliography and citations aspects of technical writing in computer science. Internet has several bibliography databases (DBLP, citeseer, cora) with latex/bibtext citation entries for publication for literature in Computer Science.
The goal of this project would be develop a tool similiar to "Spell Checker" for bibliography entries in a technical paper. Given a partial bibliography entry (e.g. some authors, part of title, publication forum), planned software will bring up closest matches from the internet bibliography databases to allow authors to choose from. Components of such a system include a internet crawler to query the databases, html/xml parsing to extract latex entries, text matching to determine best matches, etc.
Note: A recent Plan B project in the group develop crawlers to collect publications from database conference to track trends, e.g. publication activity for various sub-areas over last decade. Part of the system may be reused for this project.
Line generalization or simplification , is a well studied problem in cartography. A linestring can be simplified by choosing a subset of the points in the linestring or via chain coding. Develop efficient algorithms for a related problem to simplify a set of given line-strings in a map under the constraint of preserving topological relationships among them. For example, consider a contour map showing isolines for elevation. The goal is to simplify each of the isoline (i.e. polygon) in the map. Topological relationships to be preserved include inside, touch, etc. For example, if a isoline I1 was completely inside another isoline I2 in the original map then the simplified(I1) should inside simplified(I2). If a isoline I3 was touching another isoline I2 at exactly one point in the original map then the simplified(I3) should be touching simplified(I4) at one point. Test your solutions with a couple fo different types of maps, e.g. contour maps, road-maps, etc.
Personal Information Managers An interesting project relates to map based spatial database for personal information management (PIM) support typical GIS applications such as the following:
- Browse a map of University campus
- Locate and orient user on the map via visible landmarks around her
- Locate specific buildings or departments in the University campus
- Locate facilities (e.g. parking) near specific points of interests.
- Request determination of a walking or driving route between two buildings on campus
- Request landmark based directions for walk from a building to another
You may explore other aspects such as integration with a global positioning device (GPS), wireless communication to a web srever for dynamic loading of dataset, map compression etc.
Map accuracy evaluation: Evaluation of digital map data accuracy often starts with visual comparisons with references, e.g. aerial or satellite imagery. Goal of this project is to help visualize quality of a digital road map (e.g. TIGER files, or GPS tracks). Develop a prototype software to produce overlay of satellite imagery from terraserver.com and given digital map data for a small geographic area. A useful reading in this area is the report titled GPS TIGER Accuracy Analysis Tools (GTAAT) Evaluation and Test Results .
Location sensors inside buildings: Location sensor for a mobile computing device inside a building is an emerging new technology. It will be interesting to brain storm on the impact of this technology on spatial databases. IEEE Computer (August 2001) has three survey articles on this topic. It has applications in location aware mobile computing, security, assest tracking, information protection, routing/navigation inside buildings, and smart wheel chairs as well as smart canes for visually impaired. It will supplement other location technologies such as GPS and cell-phones which are not very effective inside many buildings. Resource on location aware mobile computing include Location based services - a database perspective , guess editor intro to special issue of IEEE computer , context aware mobile computing papers at citeseer , A survey of context aware mobile computing (Chen/Kotz, Dartmouth TR 2000-381) , NAS workshop Geospatial Tech. and IT - location aware . Other relevant readings on web include location technology overview articles ( federal requirements, wireless callers, mobile positioning , ), java location services.com: ( wireless location industry association < /a>, location interoperbility forum , mobile positioning.com : ( home , contracts awarded , ), home , tutorial , esri whitepaper , resources , SignalSoft: ( home , products , ), Snaptrack : ( home , products , ), spatial location protocol (SLoP) , and security technology:( security magazine , ncjrs report - school security ,).
Location based security policies - specification and implementation: Access control policies based on location and other context (e.g. time, activity) can solve many security problems of current wired and wireless information systems. For example, hackers from outside on's facility or region may easily be identified. However, new research is needed to (i) specify location based policies, (ii) mechanism to enforce location based policies, e.g. location authentication hardware and software, etc. Consider develop methods to address location based security issues in policy specifications and implementation mechanisms.

2.2 Past, Present and Future of Database Research

Write a survey paper on an emerging subarea of databases, e.g. spatial databases, spatial data mining, web databases, mobile databases, location based services ,
Create spatial visualizations for the collection of publications related to spatial databases or spatial data mining or other topics. The project would characterize of the goals for visualizing a collection of documents, choose as well as compare of visualization modes. Example goals are to assess the poupularity of topics, interaction among topics, assess temporal trends, e.g. change in popularity of a topic. Candidate visualizations include
1. elevation map: hills= cluster centers, height= cardinality
2. Quad-tree space division : leaf = document, non-leaf= topic
3. Subway maps: station= document(s), lines= topic
Examples of elevation maps for visualizing collection of documents are available from cartia.com .
XML is reinventing databases in many areas (e.g. query languages) related to conceptual and logical databases. Specific tag families (e.g. GML for spatial dataset) of XML are being created for different data types. XML parsers (e.g. DOM, SAX) are becoming widely available for manipulating XML data outside a database. However, querying XML datasets is not scalable since entire datset has to be parsed and processed even if a small part of the data is relevant to the query at hand.
Explore incorporating physical database notions (e.g. indexing, clustering, query processing) to process point, range, join queries on XML datatypes. Design and perform experiments to evaluate the overheads and usefulness of such information in XML datasets in context of a useful application.
Mobile database (e.g. Sybase Anywher, Oracle lite, Synchrologic, ...) are becoming important with the increase in popularity of PDAs and other mobile computing platforms. Spatial applications (e.g. roadmaps, point of interest, routing, proximity queries) are among the killer applications in this domain. Explore the impact of mobile environment on spatial databases.
Multi-copy undate and synchronization is a core problem in commercial mobile databases. Existing products (e.g. Synchrologic) provide competent solutions for files and relational table. GIS and CAD companies have also proposed custom solutions. Evaluate these solutions for synchronizing spatial datasets (e.g. roadmaps, point of interests). Are new solution needed in this space?

2.3 Benchmarking

Develop a benchmark (e.g. Sequoia 2000) to evaluate database implementations of some of the following:
1. OGIS spatial data types and operations.
2. Rastor / field data types and map algera operations
3. Spatial graph computations (e.g. mapquest.com)
4. Mobile/wireless spatial computations (e.g. vindigo.com)
Identify the components of the spatial DBMS being evaluated by the proposed benchmarks.
Design benchmarks, design experiments and perform experimental studies to characterize the dominance zone of one of the following components of a spatial DBMS:
1. Spatial indexing methods,
2. Strategies for range query processing,
3. Strategies for processing nearest neighbor queries,
4. Strategies for spatial join processing,
5. Spatial clustering methods (space filling curves),

2.4 Data Models and Query Languages

Summer 2007: Trajectories (e.g. GPS-tracks, cell-phone tracks, etc.) data-sets are becoming available and have the potential of creating hour-by-hour census for cities to facilitate understanding human behaviour in urban areas. However, current data models (e.g. OGIS spatial data types) do not support trajectories well. Challenges include the infinite-streaming nature of time-series of measurements generated by location sensors such as GPS. In addition, the measurements are neither precise nor perfect, e.g. GPS reports a mean, a standard deviation and an error code. Design conceptual (e.g. a mathematically rigourous construct with properties like completeness), logical (e.g. a set of data types) and physical (e.g. storage and computation efficiency) models to effectively represent trajectory datasets.
Study the XML databases (e.g. data format and query language).
1. Represent the following in XML Schema: OGIS spatial data type hierarchy. How would one model the inheritance (sub-types), part-of relationships among OGIS geometry subclasses in XML ? How will polymorphism for topological operations be modeled ?
2. Represent the following in XML Schema: Entity Relationship diagrams. How will one model relationships? integrity constraints?
3. Represent the following in XML Schema: pictogram (for OGIS spatial data types) enhanced entity relationship model. How will pictograms (non text icons) be modelled?
Develop a data model for the census demographics data or TIGER files using UML/ERD and pictograms. Compare the new data models with the current file format descriptions.
Extend the pictograms (used in conjunction with Entity Relationship model) and/or the spatial data types and operation in OGIS to support some of the following: (a) Spatial graph computations (e.g. mapquest.com) (b) Rastor / field data types and map algera operations (c) Mobile/wireless spatial computations (e.g. vindigo.com)
Develop a translator to generate logical object-relational schema (SQL3 create table with OGIS data types, integrity constraints) from a conceptual model (e.g. Entity Relationship model with pictogram for spatial types).
Design a set of data types and operators to model and query the following spatial applications
1. Direction (e.g. north, south, left, right, above)
2. spatial network (e.g. roadmaps, utility networks)
3. rastor data (e.g. satellite data)
Learn Oracle spatial data cartridge. Write SQL3/OGIS expressions for the spatial queries in supplementary textbook by Shekhar/Chawla. Test the correctness of queries with a spatial data-set.
Prepare a solution manual for all the problems in the supplementaty textbook on spatial databases. A broader project would be to prepare a few laboratory assignments for spatial database courses.

2.5 Storage and Indexing

Study the algorithms for finding nearest neighbor of a given point using R-tree. Design an I/O efficient algorithms to find the nearest neighbor of a given point for Grid File spatial access method. Also design an algorithm for finding the reverse nearest neighbor.
Characterize the access patterns, working set sizes of spatial queries and query processing strategies. Extend DBMIN to provide approvice appropriate buffering strategies for spatial databases. Compare your strategies with those used in the IEEE TKDE paper on buffer management for R-trees.
Study the indexing methods for mobile objects. Evaluate the performance of alternative methods including classical spatial indexing methods (e.g. R tree) to characterize the dominance zones.

2.6 Query Optimization

Extend the idea of threading for the leafs of tree based spatial access methods (e.g. R-tree) by linking nearest neighbors in some (e.g. 4 or 6) or all directions. Threads may improve the I/O efficiency of range queries, nearest neighbor queries but increase overhead of updates. Design new bottom-up algorithms for range queries and nearest neighbor queries taking advantage of the threads. Also extend the update algorithms to maintain the threads. Consider B-link and R-link tree algorithms to get a start.
Develop cost models to characterize approximate dominance zone of the items listed in previous question.
Study ICDE 2001 paper on caching techniques for B-tree (Graefe). Evaluate the proposed techniques for improving the CPU-cache performance for R-tree. Identify needed changes (e.g. is searching MOBRs different from searching bitstreams?) Design an CPU-cache efficient implementation of R-tree to minimize cache misses.
Consider using B+tree with a space filling curve to support spatial query processing (e.g. Sequoia 2000 benchmark). A DBMS company hopes to save cost of development of new access methods (e.g. R* tree) along with the associated concurrency control, recovery, performance tuning modules. Develop the following towards this goal:
1. Develop processing strategies for point query, range query, nearest neighbor, and directional queries over collections of points.
2. Develop processing strategies for point query, range query and nearest neighbor query over collections of extended objects (MOBRS).
3. Develop processing strategies for spatial join with simple join predicate e.g. overlap.
4. Compare B+tree with space filling curves with other other spatial access methods (e.g. R* tree) for processing spatial queries. Consider using Sequoia 2000 benchmark as the workload.
5. Develop cost models to characterize estimate the result sizes and I/O costs for above strategies.
6. More specific discussion follows in questions (textfile) and questions citation (postscript file) .

2.7 Transactions, Recovery

Evaluate the pros and cons of the two phase commit protocol for mobile transactions (e.g. buying a air-line ticket) using a personal information management (PIM) application on a hand-held devices (e.g. PDA). Unlike ATMs and desk-top computers, PDAs are connected to internet and servers via unreliable wireless communication. These have no disk and somewhat stable main memories (unless battery goes out). Should there be a PIM database on the hand-held device?

2.8 Concurrency Control

Design an efficient locking protocol for consurrent operation for Grid File spatial access method. Compare proposed approach with the locking protocol for concurrent operations on R-tree using a R-link data-structure.

2.9 Distributed and Parallel Databases

Study Navathe's paper on grouping queries from mobile clients to exploit common subexpressions and filter-and-refine strategies to make database server scale up to a large number of intermittantly connecting mobile clients.
Limited bandwidth of many mobile networks make it difficult to transfer spatial datasets. Explore COMPRESSION methods for transferring spatial data (e.g. oad-map) from web servers to mobile clients (e.g. Palm OS clients) in an efficient manner. Compression scheme should support fast decoding since Palm OS clients have limited computational capability.
Develop and evaluate parallel formulations of the following:
1. Routing algorithms, e.g. Hierachical routing
2. Spatial data mining algorithms, e.g. SAR, MRF BC, co-location rules, S-outlier detection, etc.
3. Mapcube algorithms

2.10 Data Warehouse

Study the paper titled "Mapcubes: A Visualization Tool for Spatial Data Warehouses" by C. T. Lu et al. Evaluate the syntax of mapcube statement for its ability to specifiy visualization of different spatial data warehouses. Explore efficient implementations of mapcube implementation reusing existing software modeules (e.g. GIS, DBMSs). For example, develop a translator to generate GIS commands, object-relational queries (SQL3 select statements with cube operator, GIS commands) from mapcube statements.
Model census demographics data using a data warehouse. What new constraints and requirements are placed this application ?

2.11 Spatial Networks

Many routing applications require routes which do not violate the turn restrictions (e.g. no left turn between 7am and 9am) posted at road intersections. Design roadmap representations and routing algorithms capable of honoring turn restrictions.
Current hierarchical routing algorithms has brute enumeration of all pairs of boundary nodes from fragments of source and destination. Design new algorithms to reduce or eliminate the brute force enumeration.
Nearest neighbor queries in road networks is a core problem for location based services, e.g. find K nearest gas stations for a vehicle on road network. To simply the problem one may first solve it for a stationary vehicle before examining the case of moving vehicles. This problem has been explored in the literature for euclidean distance based nearest neighbors but not adequately for road network distance based nearest neighbors.
A related idea is the "Allocte" operation, which divides a network given a set of servers into service areas. The service areas provide quick solution to nearest (K = 1) neighbor problem. It may be useful to think about extending "Allocate" operation or single source routing algorithms (e.g. Dijktra's) to finding K-nearest neighbors. One may also use classical eucliden distance nearest neighbor finding algorithms (e.g. R-tree based ) as a filter step. After all euclidean distance is a lower bound on the network based distance between two points. Thus finding K-nearest euclidean neighbors can provide alower bound on the distance to the Kth-nearest graph distance neighbor. Maintaining a matrix of road-network distances among service centers may help in deriving an uppoer bound on distance to Kth nearest service to help terminate the iterative search algorithms like Dijktra's. Another useful idea is to take advantage of (and refine) the indexing methods for graphs (e.g. CCAM or super-graph structure used for hierarchical routing).

2.12 Data Mining

Relational data is often mined for four common patterns, e.g. associations, sequential associations, clusters, and classification. Generalize the definitions of these patterns for the following data:
1. vector spatial data, e.g. point, line and polygon features
2. rastor spatial data
3. spatial network data, e.g. roadmaps
4. graph data, e.g. E-R diagrams, control-flow graphs for programs
5. mobile spatial objects, e.g. vehicle fleets, army units
6. text data, e.g. homework assignments, published papers
Example of generalizing patterns for spatial data are described n more detail as separate projects later in this list.
Consider a brain storming project to define interesting spatial patterns. A pragmatic approach is to identify common patterns discussed in literature on spatial analysis (e.g. hot spots, spatial trends), spatial statistics (e.g. point process, lattice theory, geo-statistical approaches, auto-correlation, interpolation), geo-science (tele-connections, e.g. el-nino effects), and data mining ( associations, classification, clustering). Another approach is to examine the spatial cognitive concepts (e.g. location, co-location, shape, direction, ...) in natural language to look for families of patterns that be useful in different spatial applications. An intersting project is to catalog and classify spatial patterns using multiple approaches.
Consider the extension of notion of association rules to a notion of "co-location" in geo-spatial datasets. Items are represented by types (e.g. McDonalds, Burger King, hospitals, police stations, bars, ...). Neighborhoods (e.g. circles of diameter D) represent transactions. Neighborshoods are implicit in contrast with the traditional explicit transactions (e.g. zipcodes). Number of neighborhoods is infinite in contrast to the traditional finite number of transactions.
Co-location patterns is a pair consisting of a set ST of spatial types and a set SI of interest measures. Example interest measures include frequency, ratio of actual frequency to expected frequency assuming uniform distribtion of type instance over space, join-selectivity, spatial cross-correlation (for subsets of size 2) etc.
Which of the interest measures satisfy monotonicity properties typical of support measure in traditional association rules? Modify "apriori" algorithm for discovering traditional association rules to discover co-locations from spatial datasets.
Yan et al. have proposed a non-tranaction based colocation miner algorithm using a large number of spatial joins. It is relatively expensive. Propose a scheme to decompose a spatial dataset into a set of transactions and residual cut-edges. Design a algorithm using a hybrid of apriori and colocation miner which has similar cost as apriori if cut-set size if very small.
Spatial classification- Study the paper titled "Predicting Locations Using Map Similarity (PLUMS) : A Framework for Spatial Data Mining" by S. Chawla et. al (Proc. ACM SIGKDD 2000 Workshop on Multimedia Data Mining). This paper explores spatial classfication problems related to rastor data sets. Develop a problem formulation for location prediction or spatial classification for vector data (e.g. point sets representing locations of crime, police stations, bars, businessed, highways, etc.). Other problems of interest include incorporation of spatial auto-correlation in other classification models, e.g. decision trees, rules, etc.
Spatial Outlier Detection- Develop efficient methods of detecting outliers in spatial datasets based on
1. difference between the values of attributes of an spatial objects and the values of attributes in the neighborhood.
2. hot spots, i.e. abnormal clusters of events such as cancer incidence or crimes.
3. changes in shapes of a spatial phenonmenon (e.g. normal variation in shape and extent of hot water area in Pacific vs. el-nino or la nina.)
Spatial Clustering- Clustering of spatial objects (e.g. elements in a spatial framework, e.g. land parcels in an agricultural field) have a couple of new issues. Nearest neighbor pairs may be constrained to belong to common clusters. For example EM algorithm has been generalized to NEM to observe this constraint. Another interesting constraints relate to the cluster sizes, cluster continuit in space etc. You may consider generalizing other clustering algorithms (e.g. K-mean, K-medoid, etc.) to observe neighborhood constraint as well as other constraints.
Secondly, statistical significance of clusters are important. In other words we want to flag regions which are not clustered by chance (e.g. complete spatial random process). For example, density of a clustered region should be significantly different from the entire space. Consider generalizing density threshold based clustering algorithms to efficiently detect regions with significant clustering of Finally, one may look for "declustered" regions besides the "clustered" regions of space. In other words regions with exceptionally low density may be of interest.
Measures of clustering go beyond density and include Moran's I, nearest neighbor index etc. However, clustering algorithms in data mining rarely look at those measures. Consider generalizing current clustering algorithms to efficiently work with spatial clustering measures.
Semi-supervised Learning- Clustering, association rules and outliers are often considered unsupervised learning method, while classification is considered a supervised method. Some researchers have proposed a hybrid of clustering and classification in terms of semi-supervised learning where learning dataset is a mixture of labelled and unlabelled samples. Semi-supervised learning improves performance of learned classifiers particularly when the set of labelled dataset is very small. Consider generalizing the notion of semi-supervised learning to spatial data mining.
Analyzing spatio-temporal datasets from CAA : Our group is collaborating with Political Scientist from Concept Analysis Agency on Forecasting Militarized Interstate Disputes (FORMID). The dataset has several attributes for each pair of country in the world for last several years. The goal is to analyze the dataset to find well-known patterns as well as new patterns. General reading on the subject on web includes 1946-99 Dataset (Oslo, Uppsala) , Conflict Management and Peace Science Journal , Proc. Uppsala Euro conference 2001 , and aelected papers:( Euro 2001 - Singer Report on Correlates of War, Euro 2001 - Brecke paper War data analysis ). Last two references provide a number of interesting patterns in the data. Spatial patterns include the most war prone countries, country pairs, regions and continents. Other patterns include the list of country-pair attributes most correlated with militarized conflicts.
Analyzing spatio-temporal datasets from NASA : Our group is collaborating with Earth Scientists from NASA to study the relationships between global climate and vegetation growth. The dataset divides the surface of earch into 0.5 - 1 degree lattitude/logitude grids and provide measurements for temperature, pressure, precipitation, solar energy, vegetation growth etc. for last several years. Earth scientists know several patterns, e.g. El-Nino (warming of Pacific oceans near Peru) reduce vegetation growth in some areas of the world and incerese vegetation growth in other areas. It also impacts grape crops in California and affects wine production. General reading on the subject on web includes el nino resources , el nino tutorial , and effects on usa .
The challenge for data mining is to first discover some of the known patterns to gain trust and confidence in the abilities of DM. This may interest Earth Scientists to examine new patterns found by data mining techniques for further analysis of statistical significance and possible adoption by the scientific community.
There are several interesting spatial data mining problems. First problem is to automate the postprocessing of patterns (e.g. association rules using pixels as transactions, attribute events as item-types) for spatial cohesiveness and explanation by other geographic features, e.g. vegetation types, soil type, elevation etc. Instead of manually producing maps for each association rules and mathing those with maps of geographic features, one may consider automating the post-processing. Better yet, one may bring postprocessing (e.g. spatial cohesiveness, spatial similarity with regions of other features such as vegetation type) into each iteration of association rule finding algorithm (e.g.Aprioi). Beside saving manual labor of postprocessing, it might provide additional filtering efficiencies if spatial measures exhibit monotonicity. There is also an opportunity for incremental computation of spatial measures across iterations of association rule finding algorithm.
Second opportunity relates to approahes to teleconnection discovery problem. Teleconnection patterns in Earth Science identify pairs < OAi, LAi >, where OAi is a spatially coherent ocean area, LAi is a spatially coherent land area, and changes in physical properties of OAi has been observed to significantly affect vegetation growth (and climate) in LAi over time. Example teleconnection is the affect of El-Nino (warming of areas of Pacific) on climate in mid-northern-lattitudes (e.g. warmer winters in Minnesota). Teleconnection discovery problem can be formulated using a bi-partite graph where nodes represent land pixels and ocean pixels. Weight(edges b/w a land pixel and an ocean pixel) = f(time-series similarity between two pixels). Weight(edges b/w two land pixels) = fL(time-series similarity, distance). Weight(edges b/w two ocean pixels) = fO(time-series similarity, distance). We find pairs, e.g. (a group L of land pixels, a group O of ocean pixels) such that aggregate_similarity(time-series for land pixels in L, time-series of ocean pixels in O) is above threshold. Objective function may include correctness and computational efficiency. Constraints include statistical significance of teleconnection. Current solution approaches are based on two phases, i.e. clustering land-pixels first using fL and clustering ocean-pixels using fO and then find correlations among land-clusters and ocean-clusters. Two phase solution can not guarantee optimality and one solutions are desired. Alternate solution procedure can be based on partitioning of the bi-partite graph which directly solves the problem.
spatio-temporal searches on the web : Have you ever felt the need to reorder the list of hits produced by web search engines by recency or distance to your location? In general, having spatio-temporal information may be useful to quickly identify the items of interest in the vast number of matches found by search engines for various key word. One way to address this problem is analyzing the documents on the web to extract their temporal and spatial footprints, which could be used in querying, ranking, mapping or indexing. A few recent workshops in early 2000s have eexplored this topic and there is room to innovate.

2.13 Rastor Data

Duality of vector and rastor representation implies that one should be able model and compute similar concepts in both models. Explore this duality by designing rastor representations and alogrithms for the following spatial applications:
1. shortest path problem
2. topological queries (e.g. inside, overlap, touch)
3. directional queries (e.g. north, left )
Review the design on terraserver.com to allow storage and query of satellite imagery for entire earth. It uses relational database for storing and querying spatial data. Propose and evaluate an alternate design using spatial database concepts for this system.
Consider a collection of rastor maps for the sea surface temperature at different times. Device an outlier detection algorithm based on
1. changes in shapes of a spatial phenonmenon (e.g. normal variation in shape and extent of hot water area in Pacific vs. el-nino or la nina.)
2.14 Links to other collections of problems
1. 10 projects (and one game): a research narrative from gpster.net (July 23rd, 2007).