Scalabale Parallel Data Mining for Association Rules

Eui-Hong (Sam) Han, George Karypis and Vipin Kumar
ACM-SIGMOD Intl. Conference on Management of Data, pp. 277 - 288, 1997
Download Paper
Abstract
One of the important problems in data mining is discovering
association rules from databases of transactions where
each transaction consists of a set of items. The most time
consuming operation in this discovery process is the computation
of the frequency of the occurrences of interesting
subset of items (called candidates) in the database of transactions.
To prune the exponentially large space of candidates,
most existing algorithms, consider only those candidates
that have a user defined minimum support. Even with
the pruning, the task of finding all association rules requires
a lot of computation power and time. Parallel computers
offer a potential solution to the computation requirement
of this task, provided efficient and scalable parallel algorithms
can be designed. In this paper, we present two new
parallel algorithms for mining association rules. The Intelligent
Data Distribution algorithm efficiently uses aggregate
memory of the parallel computer by employing intelligent
candidate partitioning scheme and uses efficient communication
mechanism to move data among the processors. The
Hybrid Distribution algorithm further improves upon the InteUigent
Data Distribution algorithm by dynamically partitioning
the candidate set to maintain good load balance.
The experimental results on a Cray T3D parallel computer
show that the Hybrid Distribution algorithm scales linearly
and exploits the aggregate memory better and can generate
more association rules with a single scan of database per
pass.
Research topics: Data mining | Parallel processing | Pattern discovery