Scalable Benchmarks, Software and Data for Data Mining, Analytics and Scientific Discoveries

Scalable Benchmarks, Software and Data for Data Mining, Analytics and Scientific Discoveries

National Science Foundation Award Number: # 0551551 (March 15, 2006 - March 14, 2009)

Contact Information:

Vipin Kumar, PI
Department of Computer Science and Engineering
4-192, EE/CSci Building
University of Minnesota
Minneapolis, MN 55455
Phone (612) 625 0726
E-mail: kumar at     URL:

Michael Steinbach, Co-PI
Department of Computer Science and Engineering
5-225 E, EE/CSci Building
University of Minnesota
Minneapolis, MN 55455
Phone (612) 625-7503
E-mail: steinbach at     URL:

List of Supported Students and Staff:

Graduate Students: Undergraduate Students:

Project Award Information:

Project Summary:

Today's connect anytime and anywhere digital society is fueling tremendous data growth, transforming the way business, science, and society function. Data in terabytes range are not uncommon today and are expected to reach petabytes in the near future for many application domains in science, engineering, business, bioinformatics, and medicine. In addition, the complexity of data is also increasing. For these reasons, there is an increasing need for automated data analysis and mining to extract the required information and knowledge from these data sets. However, the computational complexity of data mining algorithms combined with this deluge of data creates an important challenge. Hence, without a significant leap forward in computing capabilities and technological innovation, the opportunity to harvest this wealth of data will be lost. In this work, we aim to take important first steps towards such a revolution in computing capabilities and develop the underlying infrastructure that will allow other researchers to embark upon this important challenge. Particularly, our goal is to (a) develop a benchmarking suite that will be used to understand the bottlenecks in high performance data mining and guide in the development of next-generation processors and (b) devise data mining kernels that can be efficiently executed on existing and future processors.

Duration: 3 years

Journal Publications:

  1. Van Ness, B; Ramos, C; Haznadar, M; Hoering, A; Haessler, J; Crowley, J; Jacobus, S; Oken, M; Rajkumar, V; Greipp, P; Barlogie, B; Durie, B; Katz, M; Atluri, G; Fang, G; Gupta, R; Steinbach, M; Kumar, V; Mushlin, R; Johnson, D; Morgan, G, Genomic variation in myeloma: design, content, and initial application of the Bank On A Cure SNP Panel to detect associations with progression-free survival, BMC MEDICINE, p. , vol. 6, (2008). Published, 10.1186/1741-7015-6-2
  2. Varun Chandola, Arindam Banerjee, and Vipin Kumar, Anomaly Detection : A Survey, ACM Computing Surveys, Volume 41(3), July 2009. Tech Report

Books or Other One-time Publications:

  1. Rohit Gupta, Tushar Garg, Gaurav Pandey, Michael Steinbach, and Vipin Kumar, Comparative Study of Various Genomic Data Sets for Protein Function Prediction and Enhancements Using Association Analysis , bibl. SIAM International Data Mining Conference, (2007). Workshop proceedings published as CD Published of Collection: Petros Drineas Vipin Kumar Michael W. Mahoney, "Workshop on Biomedical Informatics"
  2. Gaurav Pandey, Michael Steinbach, Rohit Gupta, Tushar Garg and Vipin Kumar, Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study, ACM SIGKDD 2007, pp 540-549.
  3. Gaurav Pandey, Lakshmi Naarayanan Ramakrishnan, Michael Steinbach, and Vipin Kumar, Systematic Evaluation of Scaling Methods for Gene Expression Data, BIBM 2008, pp. 376-381, Philadelphia, PA, 3-5 Nov. 2008
  4. Boriah, S., Kumar, V., Steinbach, M., Potter, C., and Klooster, S., Land cover change detection: a case study, Proceeding of the 14th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining Las Vegas, Nevada, USA, August 24 - 27, 2008. DOI=
  5. Shyam Boriah, Varun Chandola and Vipin Kumar, Similarity Measures for Categorical Data: A Comparative Evaluation, Proceedings of the SIAM International Conference on Data Mining, SDM 2008, pp. 243-254, , April 24-26, 2008, Atlanta, Georgia
  6. Varun Chandola, Deepthi Cheboli, and Vipin Kumar, Detecting Anomalies in a Time Series Database, (2009). Technical Report, Computer Science Technical Report TR09-004
  7. Rohit Gupta, Gang Fang, Blayne Field, Michael Steinbach, and Vipin Kumar, Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms, (2009). Technical Report, TR 09-005

Research Contributions:

The data sets and kernel algorithms being developed by our group will become available to the community at large via the NU-MineBench suite. Some of these algorithms and data sets are already being requested by many other data mining researchers who like to try these techniques on large climate data sets or who like to use the algorithms developed in our group to solve their problems. Over the long term, much of the distribution of data sets and kernels will be done via NU-MineBench being maintained by our collaborators at Northwestern University. During summer 2006 we contributed a parallel implementation of Error-Tolerant Itemset routines.

Contributions to Resources for Research and Education:

    PIs Kumar and Steinbach co-taught introduction to data mining course at the University of Minnesota during Fall 2007. The course included several lectures on the applications of data mining to climate and bioinformatics as well as importance of computationally efficient algorithms due the scale of the data.
    Software, Metadata: The software implements a number Error-Tolerant association mining algorithms. These algorithms are important for finding association patterns in noisy data, e.g., many types of biomedical data. This shareware is available with MineBench and from the following website: