Re-building the data sets


My name is Horatiu Mocian, and I am a MSc student at Imperial College London.

I am working on a clustering algorithm using suffix trees. I am planning to map the nodes in a suffix tree to the VSM model, and thus obtain new TF-IDF values.

I was wondering if there is a detailed description of how the datasets of CLUTO were built, that I can follow to obtain my own TF-IDF matrix. I am thinking at the "re0" and "re1" datasets specifically, since they are subsets of Reuters 21768. Can I find out which documents from that corpora were included in re0 and re1? Is it safe to assume that all the documents containing one of the 13 categories (or 25, respectively) categories have been included in the datasets?

Thank you,
Horatiu Mocian