Scalable Benchmarks, Software and Data for Data Mining, Analytics and Scientific Discoveries

National Science Foundation Award Number: # 0551551 (March 15, 2006 - March 14, 2009)

Contact Information:

Vipin Kumar, PI
Michael Steinbach, Co-PI
List of Supported Students and Staff:

Today's connect anytime and anywhere digital society is fueling tremendous data growth, transforming the way business, science, and society function. Data in terabytes range are not uncommon today and are expected to reach petabytes in the near future for many application domains in science, engineering, business, bioinformatics, and medicine. In addition, the complexity of data is also increasing. For these reasons, there is an increasing need for automated data analysis and mining to extract the required information and knowledge from these data sets. However, the computational complexity of data mining algorithms combined with this deluge of data creates an important challenge. Hence, without a significant leap forward in computing capabilities and technological innovation, the opportunity to harvest this wealth of data will be lost. In this work, we aim to take important first steps towards such a revolution in computing capabilities and develop the underlying infrastructure that will allow other researchers to embark upon this important challenge. Particularly, our goal is to (a) develop a benchmarking suite that will be used to understand the bottlenecks in high performance data mining and guide in the development of next-generation processors and (b) devise data mining kernels that can be efficiently executed on existing and future processors.

The data sets and kernel algorithms being developed by our group will become available to the community at large via the NU-MineBench suite. Some of these algorithms and data sets are already being requested by many other data mining researchers who like to try these techniques on large climate data sets or who like to use the algorithms developed in our group to solve their problems. Over the long term, much of the distribution of data sets and kernels will be done via NU-MineBench being maintained by our collaborators at Northwestern University. During summer 2006 we contributed a parallel implementation of Error-Tolerant Itemset routines.

    PIs Kumar and Steinbach co-taught introduction to data mining course at the University of Minnesota during Fall 2007. The course included several lectures on the applications of data mining to climate and bioinformatics as well as importance of computationally efficient algorithms due the scale of the data.
    Software, Metadata: The software implements a number Error-Tolerant association mining algorithms. These algorithms are important for finding association patterns in noisy data, e.g., many types of biomedical data. This shareware is available with MineBench and from the following website: