f-measure , entropy and Purity in CLUTO Tool?

Hi all,
I download the CLUTO and gCULTO tool, then i create the file.mat using CygWin. My question is how can i know the files which is inside cluster because i want to calculate the f-measure , entropy and Purity ?

many thanks in advance

RE: Hi, Prof.--Is there any comments from you

Many thanks Prof.
I am waiting your valuable comments.
Many thanks in advance.

RE: Thanks Prof. But in Matrix how is files ordering

Many thanks Prof. for you help.
I used Doc2MAT command to convert all the textual file into .MAT file (which is the input file for CLUTO Tools). All these files (which i need to cluster it)located in one folder. I know that each file represent one row in the matrix but how can i know which file that represent the first row and which file represent the second row and so on. Is it take the file according to it ordering in the folder? that means if the first file in my folder is B and second file is A and third file is X and fourth file is D that means the ordering in the matrix is B,A,X then D.
I know also in the clusters each row represent one document but which is the same idea of the matrix but my confuse is file ordering ? Is the order in matrix same the order in the folder (which contain the files) ? How can i know that file which titled "D" is number 1 or 2 in the matrix. If i know that i will easy to me to calculate the files which inside the cluster or outside the cluster then i will get the overall F-measure , overall purity and overall entropy.

Many thanks for you Prof.

RE: I am still waiting

Hi prof.
I am still waiting i would like to know hoe can i know which document in which cluster.
Or how can i know the F-meaure and purity on data clusters.
my regards

RE: Wael, Each line in the file


Each line in the file that you provided as input to the doc2mat script is a document. The output of doc2mat is a file containing the sparse matrix representation of all the documents, such that each row corresponds to a document. The ordering of those rows is the same as the ordering of the lines of text that you provided to doc2mat. When Cluto computes the clustering it writes out the clustering file, indicating the cluster # that each row of the matrix produced by doc2mat belongs to. Thus, to figure out the mapping from documents to clusters, you just need to go back to the file that you provided to doc2mat. If p[i] is the cluster number that the ith row of the matrix belongs to, then it means that the document corresponding to the ith line in the file that you provided as input to doc2mat belongs to cluster p[i].


RE: I got the HTML file, same output which appear after clustering

Thanks prof.
I already get the solution 1 file .HTML.
But i want to calculate the f-measure ,entropy and entropy, i means i must know the files which are in cluster one and files which are i cluster two .

my regards

RE: How did you create the .mat

How did you create the .mat file?

If you created it using the doc2mat script, then each row corresponds to a document. If the original file that you supplied to doc2mat has N rows, then the resulting clustering file that you get in the .html file will have N lines. There is a 1-1 correspondence between the line # in the .html file and the line # in the file that you provided to doc2mat.
If you did not use doc2mat, the correspondence between the .html file and the .mat file still holds.

PS: This is a moderated forum, i.e., I need to approve the messages before they are shown; so there is no need to post the same comment/message multiple times...

RE: the file name is not same the raw

Hi Prof.
Many thank for you
I create the Mat. file using Doc2Mat script ,each row is represent document. I have document named by"USA.txt" and i have many different files in my collection of data set. How can i know this file in cluster one or in another cluster. Because the File.Mat include only the row number (not document name)how can i know that document is for row number 1 is not for another document, that means my dataset has names so after the clustering process the cluster number one contain only number of rows.

My regards

RE: I used doc2mat script

Yes i used doc2mat script, by using the CgyWin under Windows Environment, Please can i know what is mean by ISim which appear in solution 1 .

RE: I suggest you take a look at

I suggest you take a look at the manual of Cluto as this is explained there.

RE: External Cluster Quality Statistics (Entropy and Purity)

Thanks you very much Prof.
Really you help me more by your suggestion. I found the
External Cluster Quality Statistics are existed in CULTO that means no need to calculate it by my self.
My question how can i create the -rclassfile=sports.rclass.

I have dataset consist from 4 classes for example class a , b , c and d. The question how to create the rclass file . is there any script such as DOc2Mat which i used it to create the input matrix for CLUTO?.

Again, many thanks for you help.

my regards

RE: Look at the exporting section

Look at the exporting section of the manual: http://glaros.dtc.umn.edu/gkhome/files/fs/sw/gcluto/manual/index.html#3.4