Unsupervised Document Set Exploration Using Divisive Partitioning

Daniel Boley
Department of Computer Science and Engineering
University of Minnesota

Contact Information

4-192 EE/CS, 200 Union St. SE,
Minneapolis, MN 55455
Phone: (612) 625-3887
Fax: (612) 625-0572
Email: my last name at cs.umn.edu

WWW Page

Project URL: "http://www.cs.umn.edu/~boley/research/idm99.html" contains this short report. A set of papers on this algorithm as well as prototype software can be found in "http://www.cs.umn.edu/~boley/PDDP.html".

Project Award Information

Award Number: IIS-9811229.
Duration: September 15, 1996 to August 31, 2001.
Title: Unsupervised Document Set Exploration Using Divisive Partitioning.

List Of Supported Students And Staff (Optional)

  1. Daniel Boley, principal investigator
  2. Vivian Borst, graduate assistant

Keywords

Unsupervised clustering, data mining, classification, hierarchical taxomonies, Web agents.

Project Summary

This project is devoted to the research and development of a hierarchical divisive clustering algorithm. The basic underlying activity is the basic research needed to fully develop the divisive partitioning method and the applied research needed to demonstrate its use on specific applications and to make it easy for humans to use. The conceptual framework for the basic divisive partitioning method has been developed, and an efficient prototype implementation has been installed locally, on which all experimental results have been based. The research issues to be addressed in this project include completion of the theoretical foundations of the algorithm, experiments on larger datasets to measure the scalability of the methods, experiments on a datasets from a variety of sources as a way to gain understanding about the theory behind the method, extensions to the algorithm capabilities, extensions to other applications beyond text document clustering, and validation of algorithm results. Technology transfer forms an additional part of this project, and will be addressed by applying the methods developed as part of the project to other application datasets (e.g. genome, toxicology, medical literature datasets), as well as extending it to other paradigms such as hierarchical taxonomies, document rating aids and collaborative filtering. By incorporating it into a Web agent, the methods will be made available to a wide audience. All application datasets have either been collected by colleagues and will be shared as part of ongoing projects, or are available over the Web.

Goals, Objectives And Targeted Activities

The project goals are to develop the Principal Direction Divisive Partitioning (PDDP) Algorithm and demonstrate its effectiveness in a variety of applications.

The specific goals can be grouped into three main categories:
(1) demonstrate and test the applicability and effectiveness of the PDDP algorithm to new application areas, including very large data sets;
(2) extend the capabilities of the algorithm to include features for automatically determining the number of clusters or automatically producing labels to help users navigate through large datasets;
(3) develop user interfaces to allow easy access to the functionality of the algorithm, allowing users to apply it to a variety of datasets in a straightforward way.

The research plan included the following tasks:
(a) Demonstrate the scalability compared to older methods,
(b) Demonstrate the effectiveness in terms of quality of resulting clusters,
(c) Incorporate an autonomous stopping test,
(d) Develop methods for producing hierarchical taxonomic labels,
(e) Develop a user interface based on a Web agent.

Indication Of Success

Success is measured in the progress made within each task in the research plan. This project has been in progress for only 6 months during which we have actively pursued all 5 tasks mentioned above. In this period we have accomplished the following:
(i) We have developed the conceptual framework necessary to incorporate essential new features of the methods. The features are necessary to make the method a practical and useful method in the domain of Web exploration and include an automatic stopping test and a dynamic incremental updating feature [Ref. 1].
(ii) We have completed the design of a client-side Web agent incorporating this algorithm as a tool to automatically [i.e. without user participation] organize a large collection of Web pages visited by a user using a normal Web browser. The feasibility and practicality of the design has been demonstrated [Ref. 3], but robustness issues still need to be addressed.

Project Impact And Outcome

Human Resources: The project is currently supporting the research work of Vivian Borst working toward a Ph.D.
Education and Curriculum Development: The project is supporting development of the regular semester course CSci 5521 "Pattern Recognition" to be offered next year.
Infrastructure: The highest priority item purchased so far with the modest equipment budget has been a dedicated disk partition for large datasets. The second highest priority will be the purshase of a high performance computational engine for the computational experiments.
Industry: Not Applicable.
Activities Spawned: This activity is leading to the exploration of divisive partitioning methods for other applications taking place within the Dept, including collaborative filtering, genome mapping, vision-based defect detection. All of these are still in the preliminary stages.

GPRA Outcome Goals

  1. Discoveries at and across the frontier of science and engineering.
    We have developed a new paradigm for partitioning large datasets and the conceptual framework to incorporate new features. This framework has been made possible only by a synergy of a variety of different scientific disciplines, with the result that methods have new capabilities not possible by using methods from only one scientific discipline.

  2. Connections between discoveries and their use in service to society.
    The methods begin investigated here are of immediate and practical value to the analysis of large data sets from different applications: text documents, texture images, chemical toxicity parameters, etc. The automated analysis of large bodies of data from a variety of sources is a critical issue in the age of digital technology, and automated and unsupervised analysis tools are essential in this effort.

  3. A diverse, globally-oriented workforce of scientists and engineers.
    Even at the early stages of this project, we have promoted the education and incorporation of women in the workforce of scientists and engineers. The interdisciplinary nature of the project also encourages exposure and learning of concepts across many scientific disciplines, leading to a more well-rounded preparation to be members of the scientific workforce.

Project References

  1. D. Boley and V. Borst, Unsupervised Updating of a Classification Tree in a Dynamic Environment, in Autonomous Agents'99 Conference, 1999, to appear.
  2. D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, Partitioning-Based Clustering for Web Document Categorization, Decision Support Systems, 1999, to appear. ("dss99.ps.gz").
  3. D. Boley, M. Gini, R. Gross, E.-H. (Sam) Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, Document Categorization and Query Generation on the World Wide Web Using WebACE, AI Review, 1999, to appear. ("aireview.ps.gz").

Area Background

This project depends on the combination of several widely different fields which have come together to produce methods which are extremely successful. The foundation of the techniques -- the ideas that lead to the practicality and the scalability -- are methods for large sparse eigenvalue problems in numerical linear algebra. Specifically, the core computation is a so-called Lanczos-based method for computing the leading principal component for a large collection of sparse vectors each of which represents a document (or other data sample). The methods preserve the inherent sparsity making it possible to represent large data collections in relatively little space, and also making it possible to carry out the computations in a relatively short time. These underlying methods serve as a foundation for a variety of divisive partitioning techniques, which are analogous to the classical agglomerative clustering techniques. Instead of coalescing the individual data samples (documents) one by one into clusters (e.g. with a nearest neighbor criterion), we start with all the data samples in one big cluster and split it repeatedly along the principal component. The splitting process yields weights indicating which features are most critical in placing data samples in one cluster as opposed to another cluster, whose utility is currently underdeveloped.

Area References

  1. Golub and Van Loan, Matrix Computations, 3rd edition, Johns Hopkins Univ. Press, 1996.
  2. D. L. Boley, Principal Direction Divisive Partitioning, Data Mining and Knowledge Discovery 2(4), 1999, to appear. ("PDDP.ps.gz").
  3. Gose, Johnsonbaugh, Jost, Pattern Recognition and Image Analysis, Prentice Hall,1996.

Potential Related Projects

To be added in future years, once greater familiarity with other projects is obtained.

The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.