Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering

J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher
7th Workshop on Information Technologies and Systems, 1997
Download Paper
Clustering techniques have been used by many intelligent software agents in order to
retrieve, filter, and categorize documents available on the World Wide Web. Clustering
is also useful in extracting salient features of related web documents to automatically
formulate queries and search for other similar documents on the Web. Traditional
clustering algorithms either use a priori knowledge of document structures to define a
distance or similarity among these documents, or use probabilistic techniques such as
Bayesian classification. Many of these traditional algorithms, however, falter when the
dimensionality of the feature space becomes high relative to the size of the document
space. In this paper, we introduce two new clustering algorithms that can efeectively
cluster documents, even in the presence of a very high dimensional feature space. These
clustering techniques. which are based on generalizations of graph partitioning, do not
require pre-specified ad hoc distance functions, and are capable of automatically discovering document similarities or associations. We conduct several experiments on real Web
data using various feature selection heuristics, and compare our clustering schemes to
standard distance-based techniques, such as hierarchical agglomeration clustering, and
Bayesian classification methods, AutoClass.
Research topics: Classification | Data mining