CLUTO failed to recognize well-separated globular clusters

As a simple test, I tried to cluster a 2D set of points composed of 6 clusters. Each cluster contained ten points, which were positioned randomly around its center at a distance drawn from a normal distribution in both directions. The distances among cluster centers and the width of the distribution of points inside clusters was given so that the 6 clusters were visually non-overlapping. In spite of this, I could not find any combination of command-line options for "vcluto.exe" which would have been appropriate to recognize the clusters correctly. Please help me solve this problem.

Here is the input file, "data.mat" (the coordinates of the points of the 6 clusters are given in succession):

60 2
-0.727187 0.461223
-0.067149 0.042021
0.017421 -0.558830
-0.127384 -0.136258
-0.060875 -0.195622
-0.453923 0.030995
-0.337906 -0.066189
-0.244501 -0.083713
0.109984 -0.220099
-0.175832 -0.019360
0.566799 0.027544
1.183702 -0.242282
0.602949 -0.138401
0.801527 -0.421791
0.956166 -0.112359
1.074425 -0.141273
0.977010 0.525389
1.521451 0.225967
1.486592 0.019497
1.187931 -0.087829
0.524847 0.771784
0.729857 0.934038
1.171055 1.165033
0.598066 1.230799
0.758991 0.703215
0.703816 1.139694
0.666427 0.814383
0.800489 0.765239
0.877810 1.028471
0.513245 1.145659
2.828924 0.199996
2.550418 -0.417790
2.984896 -0.390169
3.165907 -0.181507
3.025049 -0.446569
3.473257 0.167563
2.900768 -0.083206
3.238546 -0.388105
2.764560 -0.266531
2.621064 -0.295956
3.478515 0.603132
2.775623 0.786382
3.291695 0.767752
3.082583 0.518551
3.598894 1.040041
3.679563 0.937952
3.544153 0.760760
3.469568 1.133655
2.709506 1.339515
3.508416 0.533573
2.492221 0.488501
2.166812 0.959811
2.725250 1.673108
2.650050 0.952934
2.344822 0.439185
2.332237 0.940061
2.273989 0.435294
2.777744 0.910597
2.425444 0.358104
2.455050 1.081782

Submitted by lkocsis on Fri, 2007-08-31 04:28

RE: Correct data

Sorry, in my previous message I sent not that file I wanted. Here is the correct one:

60 2
0.048550 0.099011
-0.000501 0.021890
-0.027622 0.026166
0.127645 0.121344
0.186340 -0.027467
-0.052256 -0.013313
0.010342 -0.127050
-0.080765 -0.166361
0.068044 -0.070355
-0.236459 0.028088
0.945879 -0.102898
0.866647 0.024309
1.107269 -0.125659
0.928791 -0.034718
0.998871 -0.094137
0.999918 -0.117456
0.975056 -0.102114
1.039658 -0.040167
0.973599 0.017367
0.833599 -0.011612
0.606412 0.864708
0.475461 0.807999
0.348246 1.079656
0.500973 0.840264
0.507137 0.725073
0.531654 1.043035
0.549983 0.898580
0.627808 0.754121
0.445218 0.928060
0.526081 0.993004
2.910396 0.010681
3.013518 0.184822
2.986096 -0.027511
2.883660 0.221255
3.118372 0.150853
2.998457 -0.194508
3.053622 -0.168054
2.928357 -0.057353
2.934444 -0.018582
3.031436 0.000893
3.583695 0.947174
3.427773 0.929660
3.427851 0.997033
3.479882 0.898735
3.497954 0.798726
3.527889 0.851093
3.605829 0.621124
3.562167 0.913354
3.324938 0.877720
3.569735 0.806915
2.434529 0.910116
2.391934 0.994119
2.495227 0.816252
2.537934 0.754154
2.466964 0.946790
2.450010 0.870145
2.496402 0.790405
2.482524 0.857112
2.404273 0.665140
2.629255 0.974417

Submitted by lkocsis on Fri, 2007-08-31 04:35.

RE: You have to use the

You have to use the graph-based clustering for such a dataset as it is the only one that support a euclidean-distance based similarity function.

Submitted by karypis on Fri, 2007-08-31 05:00.

RE: GRAPH method also fails...

Thank you. The results are listed below (I edited the "*.clustering.*" output files in order to make the results easier to compare).

Correct cluster ID's:
0000000000-1111111111-2222222222-3333333333-4444444444-5555555555

vcluster data.mat 6 -clmethod=graph -sim=dist
1001100010-1111111111-2222122222-5535533333-4444444454-5555555555

vcluster data.mat 6 -clmethod=graph -sim=dist -agglofrom=10
0000000000-1011111110-2222022222-3333333333-4444444434-5553555553

These results are not really satisfying... I expected that vcluster will be able to exactly recognize my well-separated spherical clusters, since the above options worked very well in case of the example file "t4.mat", which seems much more difficult to cluster.

I would like to mention that the hierarchical cluster analysis algorithm of STATISTICA 7.1 could exactly separate my clusters, independently of the amalgamation rule (unweighted pair-group average, Wald's method etc.). The reason why I would like to use your software instead of STATISTICA is that I need to cluster lots of datasets from the command-line.

Thank you for your help in advance.

Laszlo Kocsis, MD PhD
Hungarian Academy of Sciences

Submitted by lkocsis on Fri, 2007-08-31 06:31.

RE: The graph is too small for

The graph is too small for the nnbrs of the graph model.

Try the following

vcluster data.mat 6 -clmethod=graph -sim=dist -nnbrs=10

vcluster data.mat 5 -clmethod=graph -sim=dist -nnbrs=10

(the 5 is because one of the clusters becomes a connected component in the resulting graph).

Submitted by karypis on Fri, 2007-08-31 09:44.

RE: The latter one worked. Thank

The latter one worked.
Thank you again.

One more question:
When do I have to expect that "the clusters become a connected component in the resulting graph"?

Submitted by lkocsis on Fri, 2007-08-31 10:41.

RE: when there are groups of

when there are groups of points that are far away from others.

Submitted by karypis on Fri, 2007-08-31 11:14.

Navigation Menu