How-to use the dataset

I'm Hoang Thanh Lam - a Vietnamese student at Moscow State University Russia. I'm going to apply particle swarm optimization to clustering data problems. I found that your datasets are very useful for my project. But I have some questions concerned with how-to use these datasets, hope you answer soon, thank you very much:
1/ The 15 datasets described in you article: "Criterion Functions for Document Clustering
Experiments and Analysis" are quite different from the datasets downloaded from your website at URL: http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/datasets.tar.gz. For example the Fbis in the article was described with 12674 terms while in file fbis.mat there are only 2000 columns, this means, there are only 2000 terms in all documents. Why are there different between them?
2/ In my knowledge, an element of a sparse matrix is a pair,in which the first one is a column's number (term's index?) and the second one is a floating point value. You didn't say in detail what the second value is, but I guess that It is the number of appearance of a term in a document, isn't it? I need this information because I want to build a TFIDF vector for each document.
3/ Were all the documents preprocessed to remove the common words and to stem by the Poster's stemming algorithm?

I'm looking forward for your answers. Thanks again!

Submitted by Anonymous on Sun, 2007-02-11 15:09

RE: Thanks

In my knowledge, an element of a sparse matrix is a pair,in which the first one is a column's number (term's index?) and the second one is a floating point value. You didn't say in detail what the second value is, but I guess that It is the number of appearance of a term in a document, isn't it? I need this information because I want to build a TFIDF vector for each document.
Samrx

Submitted by DavidRonald on Mon, 2012-07-09 04:36.

RE: how 2 use dataset

thank u

Submitted by kalpana on Wed, 2012-03-21 11:57.

RE: About datasets

From where I will get the original documents which are used to create datasets?

Submitted by abcd on Mon, 2010-03-15 23:38.

RE: Thanks to you for the query.

Thanks to you for the query. I also need this information. I am also having this problems. If there is any solution please forward it here. It will be highly acceptable.

dogs

Submitted by rims369 on Thu, 2010-01-14 07:01.

RE: I believe the problem with

I believe the problem with fbis is that when the tar file was created I included a pruned version of the dataset in which only the 2K most frequently occurring terms were used.
As far as the second value, this is just the term frequency.
And yes, all documents have been stopped and stemmed.

george

Submitted by karypis on Wed, 2007-02-14 07:05.

Navigation Menu