a CLUTO clustering scalability question

Hi everybody

I'm looking at different clustering packages and was wondering about CLUTO's clustering scalability

I'm looking to cluster several millions of documents/instances(1-10 million) with several hundreds of thousands of features (100,000-300,000)
could CLUTO handle this load? if not, what load could it handle approximately?
how important is it for me to do some feature reduction in that respect?

thank you so much

RE: Cluto should be able to

Cluto should be able to handle that (at least the rb-based partitional routines which are the default options) as long as the data can fit into the memory.


thanks for the quick reply :)

two last questions:

1) lets say I have a vector array of size X, how much is CLUTO rb expected to add into memory

2) is there a way to make one/some of CLUTO's clustering algorithms find the best number of clusters by themselves?


RE: Not quite sure how the array

Not quite sure how the array of size X applies to clustering, but if the dataset you are trying to cluster contains n objects and the total number of non-zeros over all these n objects is m, then the memory complexity is about 4*(n+m) words, where a word is usually 4 bytes, depending on your architecture.

As far as (2), the answer is no.

RE: thanks a lot for the quick

thanks a lot for the quick and helpfull answers :)
I'll check it out