Specific projects under each categories can be found from different sources. My own research interests lie in the area of spatial databases and spatial data mining. Some of the current projects in this area are listed at Problems of Current Interest. Some of the open problems are listed at in the second part of this document. You are welcome to chose a project related to these problems and interact with the spatial database research group.
Other faculty members with reesarch projects in database and data mining area include Prof. Vipin Kumar, Prof. George Karypis, Prof. John Carlis, Prof. Jaideep Srivastava, and Prof. John Riedl. You are encouraged to visit them and browse their web-pages to learn about their research projects. You may find interesting ideas for course projects this way.
An example of possible results is shown in Figure 3 in C. H. Papadimitriou, Database Metatheory: Asking the Big Queries, Proceeding of 1995 Conference on Principles of Database Systems, ACM SIGMOD (reproduced on pages 656 of 3rd edition of Readings in Database Systems, Edited by M. Stonebraker and J. M. Hellerstein, Morgan Kauffman, ISBN 1-55860-523-1). More general analysis related to cross-citations include Impact factor (essay) , and journal ratings .
These problems are suitable for course projects leading to MS and PhD thesis. Interested researchers are encouraged to contact me via email or during office hours.
Multi-copy undate and synchronization is a core problem in commercial mobile databases. Existing products (e.g. Synchrologic) provide competent solutions for files and relational table. GIS and CAD companies have also proposed custom solutions. Evaluate these solutions for synchronizing spatial datasets (e.g. roadmaps, point of interests). Are new solution needed in this space?
A related idea is the "Allocte" operation, which divides a network given a set of servers into service areas. The service areas provide quick solution to nearest (K = 1) neighbor problem. It may be useful to think about extending "Allocate" operation or single source routing algorithms (e.g. Dijktra's) to finding K-nearest neighbors. One may also use classical eucliden distance nearest neighbor finding algorithms (e.g. R-tree based ) as a filter step. After all euclidean distance is a lower bound on the network based distance between two points. Thus finding K-nearest euclidean neighbors can provide alower bound on the distance to the Kth-nearest graph distance neighbor. Maintaining a matrix of road-network distances among service centers may help in deriving an uppoer bound on distance to Kth nearest service to help terminate the iterative search algorithms like Dijktra's. Another useful idea is to take advantage of (and refine) the indexing methods for graphs (e.g. CCAM or super-graph structure used for hierarchical routing).
Secondly, statistical significance of clusters are important. In other words we want to flag regions which are not clustered by chance (e.g. complete spatial random process). For example, density of a clustered region should be significantly different from the entire space. Consider generalizing density threshold based clustering algorithms to efficiently detect regions with significant clustering of Finally, one may look for "declustered" regions besides the "clustered" regions of space. In other words regions with exceptionally low density may be of interest.
Measures of clustering go beyond density and include Moran's I, nearest neighbor index etc. However, clustering algorithms in data mining rarely look at those measures. Consider generalizing current clustering algorithms to efficiently work with spatial clustering measures.
The challenge for data mining is to first discover some of the known patterns to gain trust and confidence in the abilities of DM. This may interest Earth Scientists to examine new patterns found by data mining techniques for further analysis of statistical significance and possible adoption by the scientific community.
There are several interesting spatial data mining problems. First problem is to automate the postprocessing of patterns (e.g. association rules using pixels as transactions, attribute events as item-types) for spatial cohesiveness and explanation by other geographic features, e.g. vegetation types, soil type, elevation etc. Instead of manually producing maps for each association rules and mathing those with maps of geographic features, one may consider automating the post-processing. Better yet, one may bring postprocessing (e.g. spatial cohesiveness, spatial similarity with regions of other features such as vegetation type) into each iteration of association rule finding algorithm (e.g.Aprioi). Beside saving manual labor of postprocessing, it might provide additional filtering efficiencies if spatial measures exhibit monotonicity. There is also an opportunity for incremental computation of spatial measures across iterations of association rule finding algorithm.
Second opportunity relates to approahes to teleconnection discovery problem. Teleconnection patterns in Earth Science identify pairs < OAi, LAi >, where OAi is a spatially coherent ocean area, LAi is a spatially coherent land area, and changes in physical properties of OAi has been observed to significantly affect vegetation growth (and climate) in LAi over time. Example teleconnection is the affect of El-Nino (warming of areas of Pacific) on climate in mid-northern-lattitudes (e.g. warmer winters in Minnesota). Teleconnection discovery problem can be formulated using a bi-partite graph where nodes represent land pixels and ocean pixels. Weight(edges b/w a land pixel and an ocean pixel) = f(time-series similarity between two pixels). Weight(edges b/w two land pixels) = fL(time-series similarity, distance). Weight(edges b/w two ocean pixels) = fO(time-series similarity, distance). We find pairs, e.g. (a group L of land pixels, a group O of ocean pixels) such that aggregate_similarity(time-series for land pixels in L, time-series of ocean pixels in O) is above threshold. Objective function may include correctness and computational efficiency. Constraints include statistical significance of teleconnection. Current solution approaches are based on two phases, i.e. clustering land-pixels first using fL and clustering ocean-pixels using fO and then find correlations among land-clusters and ocean-clusters. Two phase solution can not guarantee optimality and one solutions are desired. Alternate solution procedure can be based on partitioning of the bi-partite graph which directly solves the problem.