Re-building the data sets

Hello,

My name is Horatiu Mocian, and I am a MSc student at Imperial College London.

I am working on a clustering algorithm using suffix trees. I am planning to map the nodes in a suffix tree to the VSM model, and thus obtain new TF-IDF values.

I was wondering if there is a detailed description of how the datasets of CLUTO were built, that I can follow to obtain my own TF-IDF matrix. I am thinking at the "re0" and "re1" datasets specifically, since they are subsets of Reuters 21768. Can I find out which documents from that corpora were included in re0 and re1? Is it safe to assume that all the documents containing one of the 13 categories (or 25, respectively) categories have been included in the datasets?

Thank you,
Horatiu Mocian

Submitted by horatiu.mocian on Sun, 2009-07-19 03:17

Navigation Menu

Re-building the data sets