A Scalable Algorithm for Clustering Protein Sequences
Valerie Guralnik and George Karypis |
SIGKDD Workshop on Bioinformatics, BIOKDD, 2001 |
Download Paper |
Abstract The enormous growth of public sequence databases and continuing addition of fully sequenced genomes has created many challenges in developing novel and scalable computational techniques for searching, comparing, and analyzing these databases. Over the years, many methods have been developed for clustering proteins according to their sequence similarity. However, most of these methods tend to have a computational complexity that is at least quadratic on the number of sequences. In this paper we present an entirely different approach to protein clustering that does not require an all-against-all analysis and uses a near-linear complexity K-means based clustering algorithm. Our experimental evaluation on three different data sets containing up to 43,569 protein sequences, show that this approach leads to reasonably good clusters. |
Research topics: Bioinformatics | Clustering | Data mining |