On efficiently summarizing categorical databases

Jianyong Wang and George Karypis
Knowledge and Information Systems, vol. 9, no. 1, 2006
Download Paper
Abstract
Frequent itemset mining was initially proposed and has been studied
extensively in the context of association rule mining. In recent years, several
studies have also extended its application to transaction or document clustering.
However, most of the frequent itemset based clustering algorithms need to first
mine a large intermediate set of frequent itemsets in order to identify a subset of
the most promising ones that can be used for clustering. In this paper, we study
how to directly find a subset of high quality frequent itemsets that can be used
as a concise summary of the transaction database and to cluster the categorical
data. By exploring key properties of the subset of itemsets that we are interested
in, we proposed several search space pruning methods and designed an efficient
algorithm called SUMMARY. Our empirical results show that SUMMARY
runs very fast even when the minimum support is extremely low and scales
very well with respect to the database size, and surprisingly, as a pure frequent
itemset mining algorithm it is very effective in clustering the categorical data and
summarizing the dense transaction databases.
Research topics: Clustering | Data mining