A Scalable Algorithm for Clustering Sequential Data

Valerie Guralnik and George Karypis
1st IEEE Conference on Data Mining, pp. 179 - 186, 2001
Download Paper
Many scientific and commercial domains have seen an enormous growth of data in recent years. Such data sets have inherent sequential nature. The clustering of such data is useful for various purposes. Over the years, many methods have been developed for clustering objects according to their similarity. However, in contexts of sequential data these methods tend to have a computational complexity that is at least quadratic on the number of sequences, as they require an all-against-all initial analysis. In this paper we present an entirely different approach to sequence clustering that does not require an all-against-all analysis and uses a near-linear complexity K -means based clustering algorithm. Our experimental evaluation in different domains show that this approach is not only scalable, but also leads to reasonably good clusters.
Research topics: Bioinformatics | Clustering | Data mining