A Scalable Algorithm for Clustering Protein Sequences

Valerie Guralnik and George Karypis
SIGKDD Workshop on Bioinformatics, BIOKDD, 2001
Download Paper
The enormous growth of public sequence databases and continuing addition of fully sequenced genomes has created many
challenges in developing novel and scalable computational
techniques for searching, comparing, and analyzing these
databases. Over the years, many methods have been developed for clustering proteins according to their sequence
similarity. However, most of these methods tend to have a
computational complexity that is at least quadratic on the
number of sequences. In this paper we present an entirely
different approach to protein clustering that does not require
an all-against-all analysis and uses a near-linear complexity K-means based clustering algorithm. Our experimental evaluation on three different data sets containing up to
43,569 protein sequences, show that this approach leads to
reasonably good clusters.
Research topics: Bioinformatics | Clustering | Data mining