Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Ying Zhao and George Karypis
Machine Learning, 55, pp. 311-331, 2004
Download Paper
Abstract
This paper evaluates the performance of different criterion functions in the context of partitional
clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three
of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive
experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various
criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set
of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions
lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions
depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii)
the degree to which they can lead to reasonably balanced clusters.
Research topics: Clustering | CLUTO | Data mining | Text mining