Introduction to Data Mining (Second Edition)

Introduction to Data Mining


Pang-Ning Tan, Michigan State University,
Michael Steinbach, University of Minnesota
Anuj Karpatne, University of Minnesota
Vipin Kumar, University of Minnesota


Quick Links: Webpage for First Edition (2005)
Link to Pearson Page of Book,   Link to electronic version of book
 

Contact info: dmbook@cs.umn.edu

Highlights:

  • Provides both theoretical and practical coverage of all data mining topics.
  • Includes extensive number of integrated examples and figures.
  • Offers instructor resources including solutions for exercises and complete set of lecture slides.
  • Assumes only a modest statistics or mathematics background, and no database knowledge is needed.
  • Topics covered include classification, association analysis, clustering, anomaly detection, and avoiding false discoveries.
 

What is New in the Second Edition?

  • Avoiding False Discoveries: A completely new addition in the second edition is a chapter on how to avoid false discoveries and produce valid results, which is novel among other contemporary textbooks on data mining. It supplements the discussions in the other chapters with a discussion of the statistical concepts (statistical significance, p-values, false discovery rate, permutation testing, etc.) relevant to avoiding spurious results, and then illustrates these concepts in the context of data mining techniques. This chapter addresses the increasing concern over the validity and reproducibility of results obtained from data analysis. The addition of this chapter is a recognition of the importance of this topic and an acknowledgment that a deeper understanding of this area is needed for those analyzing data.

  • Classification: Some of the most significant improvements in the text have been in the two chapters on classification. The introductory chapter uses the decision tree classifier for illustration, but the discussion on many topics—those that apply across all classification approaches—has been greatly expanded and clarified, including topics such as overfitting, underfitting, the impact of training size, model complexity, model selection, and common pitfalls in model evaluation. Almost every section of the advanced classification chapter has been significantly updated. The material on Bayesian networks, support vector machines, and artificial neural networks has been significantly expanded. We have added a separate section on deep networks to address the current developments in this area. The discussion of evaluation, which occurs in the section on imbalanced classes, has also been updated and improved.

  • Anomaly Detection: Anomaly detection has been greatly revised and expanded. Existing approaches—statistical, nearest neighbor/density-based, and clustering based—have been retained and updated, while new approaches have been added: reconstruction-based, one-class classification, and information-theoretic. The reconstruction-based approach is illustrated using autoencoder networks that are part of the deep learning paradigm.

  • Association Analysis: The changes in association analysis are more localized. We have completely reworked the section on the evaluation of association patterns (introductory chapter), as well as the sections on sequence and graph mining (advanced chapter).

  • Clustering: Changes to cluster analysis are also localized. The introductory chapter added the K-means initialization technique and an updated discussion of cluster evaluation. The advanced clustering chapter adds a new section on spectral graph clustering.

  • Data: The data chapter has been updated to include discussions of mutual information and kernel-based techniques.

  • Exploring Data: The data exploration chapter has been removed from the print edition of the book, but is available on the web.

  • Appendices: All appendices are available on the web. A new appendix provides a brief discussion of scalability in the context of big data.
 

Sample Chapters:


Resources for Instructors and Students:

Link to PowerPoint Slides

Link to Figures as PowerPoint Slides

Links to Python Notebooks and Tutorials

Link to R Code Examples (Courtesy: Michael Hahsler)

Solution Manual and Question Bank

Errata

Additional Resources


PowerPoint Slides:

  1. Introduction [PPT] [PDF] (Update: 09 Sept, 2020).

  2. Data [PPT] [PDF] (Update: 27 Jan, 2021).

  3. Classification: Basic Concepts and Techniques

    • Basic Concepts and Decision Trees [PPT] [PDF] (Update: 01 Feb, 2021).

    • Model Overfitting [PPT] [PDF] (Update: 03 Feb, 2021).

  4. Classification: Alternative Techniques

    • Rule-based Classifier [PPT] [PDF] (Update: 30 Sept, 2020).

    • Nearest Neighbor Classifiers [PPT] [PDF] (Update: 10 Feb, 2021).

    • Naïve Bayes Classifier [PPT] [PDF] (Update: 08 Feb, 2021).

    • Artificial Neural Networks [PPT] [PDF] (Update: 22 Feb, 2021).

    • Support Vector Machine [PPT] [PDF] (Update: 17 Feb, 2020.

    • Ensemble Methods [PPT] [PDF] (Update: 11 Oct 2021).

    • Class Imbalance Problem [PPT] [PDF] (Update: 15 Feb, 2021).

  5. Association Analysis: Basic Concepts and Algorithms [PPT] [PDF] (Update: 08 Mar, 2021).

  6. Association Analysis: Advanced Concepts [PPT] [PDF] (Update: 15 Mar, 2021).

  7. Cluster Analysis: Basic Concepts and Algorithms [PPT] [PDF] (Update: 24 Mar, 2021).

  8. Cluster Analysis: Additional Issues and Algorithms [PPT] [PDF] (Update: 31 Mar, 2021).

  9. Anomaly Detection [PPT] [PDF] (Update: 29 Nov, 2019).

  10. Avoiding False Discoveries [PPT] [PDF] (Update: 14 Feb, 2018).