Introduction to Data Mining (Second Edition)

Introduction to Data Mining


Pang-Ning Tan, Michigan State University,
Michael Steinbach, University of Minnesota
Anuj Karpatne, University of Minnesota
Vipin Kumar, University of Minnesota


Preface to the Second Edition
What is New in the Second Edition?
Table of Contents
Sample Chapters
Resources for Instructors and Students
Solution Manual and Question Bank
Webpage for First Edition (2005)
Errata (coming soon)

Link to Pearson Page of Book
 

Contact info: dmbook@cs.umn.edu

Highlights:

  • Provides both theoretical and practical coverage of all data mining topics.
  • Includes extensive number of integrated examples and figures.
  • Offers instructor resources including solutions for exercises and complete set of lecture slides.
  • Assumes only a modest statistics or mathematics background, and no database knowledge is needed.
  • Topics covered include classification, association analysis, clustering, anomaly detection, and avoiding false discoveries.
 

What is New in the Second Edition?

  • Avoiding False Discoveries: A completely new addition in the second edition is a chapter on how to avoid false discoveries and produce valid results, which is novel among other contemporary textbooks on data mining. It supplements the discussions in the other chapters with a discussion of the statistical concepts (statistical significance, p-values, false discovery rate, permutation testing, etc.) relevant to avoiding spurious results, and then illustrates these concepts in the context of data mining techniques. This chapter addresses the increasing concern over the validity and reproducibility of results obtained from data analysis. The addition of this chapter is a recognition of the importance of this topic and an acknowledgment that a deeper understanding of this area is needed for those analyzing data.

  • Classification: Some of the most significant improvements in the text have been in the two chapters on classification. The introductory chapter uses the decision tree classifier for illustration, but the discussion on many topics—those that apply across all classification approaches—has been greatly expanded and clarified, including topics such as overfitting, underfitting, the impact of training size, model complexity, model selection, and common pitfalls in model evaluation. Almost every section of the advanced classification chapter has been significantly updated. The material on Bayesian networks, support vector machines, and artificial neural networks has been significantly expanded. We have added a separate section on deep networks to address the current developments in this area. The discussion of evaluation, which occurs in the section on imbalanced classes, has also been updated and improved.

  • Anomaly Detection: Anomaly detection has been greatly revised and expanded. Existing approaches—statistical, nearest neighbor/density-based, and clustering based—have been retained and updated, while new approaches have been added: reconstruction-based, one-class classification, and information-theoretic. The reconstruction-based approach is illustrated using autoencoder networks that are part of the deep learning paradigm.

  • Association Analysis: The changes in association analysis are more localized. We have completely reworked the section on the evaluation of association patterns (introductory chapter), as well as the sections on sequence and graph mining (advanced chapter).

  • Clustering: Changes to cluster analysis are also localized. The introductory chapter added the K-means initialization technique and an updated discussion of cluster evaluation. The advanced clustering chapter adds a new section on spectral graph clustering.

  • Data: The data chapter has been updated to include discussions of mutual information and kernel-based techniques.

  • Exploring Data: The data exploration chapter has been removed from the print edition of the book, but is available on the web.

  • Appendices: All appendices are available on the web. A new appendix provides a brief discussion of scalability in the context of big data.
 

Sample Chapters:


Resources for Instructors and Students:

Link to PowerPoint Slides

Links to Software and Tutorials

Errata (coming soon)

Solution Manual and Question Bank

Additional Resources


PowerPoint Slides:

  1. Introduction [PPT] [PDF] (last updated: 14 Feb, 2018).

  2. Data [PPT] [PDF] (last updated: 14 Feb, 2018).

  3. Classification: Basic Concepts and Techniques

    • Basic Concepts and Decision Trees [PPT] [PDF] (last updated: 14 Feb, 2018).

    • Model Overfitting [PPT] [PDF] (last updated: 14 Feb, 2018).

  4. Classification: Alternative Techniques

    • Rule-based Classifier [PPT] [PDF] (last updated: 14 Feb, 2018).

    • Nearest Neighbor Classifiers [PPT] [PDF] (last updated: 14 Feb, 2018).

    • Naïve Bayes Classifier [PPT] [PDF] (last updated: 14 Feb, 2018).

    • Artificial Neural Networks [PPT] [PDF] (last updated: 14 Feb, 2018).

    • Support Vector Machine [PPT] [PDF] (last updated: 14 Feb, 2018).

    • Ensemble Methods [PPT] [PDF] (last updated: 14 Feb, 2018).

    • Class Imbalance Problem [PPT] [PDF] (last updated: 14 Feb, 2018).

  5. Association Analysis: Basic Concepts and Algorithms [PPT] [PDF] (last updated: 14 Feb, 2018).

  6. Association Analysis: Advanced Concepts [PPT] [PDF] (last updated: 14 Feb, 2018).

  7. Cluster Analysis: Basic Concepts and Algorithms [PPT] [PDF] (last updated: 14 Feb, 2018).

  8. Cluster Analysis: Additional Issues and Algorithms [PPT] [PDF] (last updated: 14 Feb, 2018).

  9. Anomaly Detection [PPT] [PDF] (last updated: 14 Feb, 2018).

  10. Avoiding False Discoveries [PPT] [PDF] (last updated: 14 Feb, 2018).