Unsupervised Document Set Exploration Using Divisive Partitioning
Daniel Boley
Department of Computer Science and Engineering
University of Minnesota
Contact Information
4-192 EE/CS, 200 Union St. SE,
Minneapolis, MN 55455
Phone: (612) 625-3887
Fax: (612) 625-0572
Email: my last name at cs.umn.edu
WWW Page
Project URL:
"http://www.cs.umn.edu/~boley/research/idm99.html"
contains this short report. A set of papers on this algorithm as well as
prototype software can be found in
"http://www.cs.umn.edu/~boley/PDDP.html".
Project Award Information
Award Number: IIS-9811229.
Duration: September 15, 1996 to August 31, 2001.
Title: Unsupervised Document Set Exploration Using Divisive Partitioning.
List Of Supported Students And Staff (Optional)
- Daniel Boley, principal investigator
- Vivian Borst, graduate assistant
Keywords
Unsupervised clustering, data mining, classification, hierarchical
taxomonies, Web agents.
Project Summary
This project is devoted to the research and development of a hierarchical
divisive clustering algorithm. The basic underlying activity is the basic
research needed to fully develop the divisive partitioning method and the
applied research needed to demonstrate its use on specific applications and
to make it easy for humans to use. The conceptual framework for the basic
divisive partitioning method has been developed, and an efficient prototype
implementation has been installed locally, on which all experimental results
have been based.
The research issues to be addressed
in this project include completion of the theoretical foundations of the
algorithm, experiments on larger datasets to measure the
scalability of the methods, experiments on a datasets from a variety
of sources as a way to gain understanding about the theory behind the
method, extensions to the algorithm capabilities, extensions to other
applications beyond text document clustering, and validation of algorithm
results. Technology transfer forms an additional part of this project,
and will be addressed by applying the methods developed as part of the
project to other application datasets (e.g. genome, toxicology, medical
literature datasets), as well as extending it to other paradigms such as
hierarchical taxonomies, document rating aids and collaborative filtering.
By incorporating it into a Web agent, the methods will be made available
to a wide audience. All application datasets have either been collected by
colleagues and will be shared as part of ongoing projects, or are available
over the Web.
Goals, Objectives And Targeted Activities
The project goals are to develop the Principal
Direction Divisive Partitioning (PDDP) Algorithm and demonstrate its effectiveness in a
variety of applications.
The specific goals can be grouped into three main
categories:
(1) demonstrate and test the applicability and effectiveness
of the PDDP algorithm
to new application areas, including very large data sets;
(2) extend the capabilities of the algorithm to include features for
automatically determining the number of clusters or automatically producing
labels to help users navigate through large datasets;
(3) develop user interfaces to allow easy access to the functionality of the
algorithm, allowing users to apply it to a variety of datasets in a
straightforward way.
The research plan
included the following tasks:
(a) Demonstrate the scalability compared to older methods,
(b) Demonstrate the effectiveness in terms of quality of resulting clusters,
(c) Incorporate an autonomous stopping test,
(d) Develop methods for producing hierarchical taxonomic labels,
(e) Develop a user interface based on a Web agent.
Indication Of Success
Success is measured in the progress made within each task in the research
plan. This project has been in progress for only 6 months during which we
have actively pursued all 5 tasks mentioned above.
In this period we have accomplished the following:
(i) We have developed the conceptual framework necessary to
incorporate essential new features of the methods. The features are
necessary to make the method a practical and useful method in the domain of
Web exploration and include an automatic stopping test and a dynamic
incremental updating feature [Ref. 1].
(ii) We have completed the design of a client-side Web agent incorporating
this algorithm as a tool to automatically [i.e. without user participation]
organize a large collection of Web
pages visited by a user using a normal Web browser.
The feasibility and practicality of the design has been demonstrated
[Ref. 3], but robustness issues still need to be addressed.
Project Impact And Outcome
Human Resources:
The project is currently supporting the research work of Vivian Borst
working toward a Ph.D.
Education and Curriculum Development:
The project is supporting development of the regular semester course
CSci 5521 "Pattern Recognition" to be offered next year.
Infrastructure:
The highest priority item purchased so far with the modest equipment budget
has been a dedicated disk partition for large datasets. The second highest
priority will be the purshase of a high performance computational engine
for the computational experiments.
Industry:
Not Applicable.
Activities Spawned:
This activity is leading to the exploration of divisive partitioning methods
for other applications taking place within the Dept, including collaborative
filtering, genome mapping, vision-based defect detection. All of these are
still in the preliminary stages.
GPRA Outcome Goals
- Discoveries at and across the frontier of science and engineering.
We have developed a new paradigm for partitioning large datasets
and the conceptual framework to incorporate new features.
This framework has been made possible only by a synergy of a variety of
different scientific disciplines, with the result that methods have new
capabilities not possible by using methods from only one scientific
discipline.
- Connections between discoveries and their use in service to society.
The methods begin investigated here are of immediate and practical value to
the analysis of large data sets from different applications: text documents,
texture images, chemical toxicity parameters, etc. The automated analysis
of large bodies of data from a variety of sources is a critical issue in the
age of digital technology, and automated and unsupervised analysis tools are
essential in this effort.
- A diverse, globally-oriented workforce of scientists and engineers.
Even at the early stages of this project, we have promoted the education and
incorporation of women in the workforce of scientists and engineers.
The interdisciplinary nature of the project also encourages exposure and
learning of concepts across many scientific disciplines, leading to a more
well-rounded preparation to be members of the scientific workforce.
Project References
-
D. Boley and V. Borst, Unsupervised Updating of a Classification Tree in a Dynamic Environment, in Autonomous Agents'99 Conference, 1999, to appear.
-
D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, Partitioning-Based Clustering for Web Document Categorization, Decision Support Systems, 1999, to appear.
("dss99.ps.gz").
-
D. Boley, M. Gini, R. Gross, E.-H. (Sam) Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, Document Categorization and Query Generation on the World Wide Web Using WebACE, AI Review, 1999, to appear.
("aireview.ps.gz").
Area Background
This project depends on the combination of several widely different fields
which have come together to produce methods which are extremely successful.
The foundation of the techniques -- the ideas that lead to the practicality
and the scalability -- are methods for large sparse eigenvalue problems in
numerical linear algebra. Specifically, the core computation is a so-called
Lanczos-based method for computing the leading principal component for a
large collection of sparse vectors each of which represents a document (or
other data sample). The methods preserve the inherent sparsity making it
possible to represent large data collections in relatively little space, and
also making it possible to carry out the computations in a relatively short
time. These underlying methods serve as a foundation for a variety of divisive
partitioning techniques, which are analogous to the classical agglomerative
clustering techniques. Instead of coalescing the individual data samples
(documents) one by one into clusters (e.g. with a nearest neighbor
criterion), we start with all the data samples in one big cluster and split
it repeatedly along the principal component. The splitting process yields
weights indicating which features are most critical in placing data samples
in one cluster as opposed to another cluster, whose utility is currently
underdeveloped.
Area References
- Golub and Van Loan, Matrix Computations, 3rd
edition, Johns Hopkins Univ. Press, 1996.
-
D. L. Boley, Principal Direction Divisive Partitioning, Data Mining and Knowledge Discovery 2(4), 1999, to appear.
("PDDP.ps.gz").
- Gose, Johnsonbaugh, Jost,
Pattern Recognition and Image Analysis,
Prentice Hall,1996.
Potential Related Projects
To be added in future years, once greater familiarity with other projects is
obtained.
The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.