ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets

Mahesh Joshi, George Karypis, and Vipin Kumar
12th Intl. Parallel Processing Symposium, pp. 573 - 579, 1998
Download Paper
In this paper, we present ScalParC (Scalable Parallel Classifier), a new parallel
formulation of a decision tree based classification process. Like other state-of-the-art
decision tree classifiers such as SPRINT, ScalParC is suited for handling large datasets.
We show that existing parallel formulation of SPRINT is unscalable, whereas ScalParC
is shown to be scalable in both runtime and memory requirements. We present the
experimental results of classifying up to 6.4 million records on up to 128 processors of
Cray T3D, in order to demonstrate the scalable behavior of ScalParC. A key component
of ScalParC is the parallel hash table. The proposed parallel hashing paradigm can be
used to parallelize other algorithms that require many concurrent updates to a large
hash table.
Research topics: Classification | Data mining | Parallel processing