Initialy my appologies for my lack of knowlage as i am rather new to the area of NLP and clustering. I am looking to cluster a set of "web pages" together based on there contents, into a know (or, infact, preferebly unknown) number of clusters. After reading the introduction section of the CLUTO manual the system appears to offer at least a good starting point, however, being low in knowlage of the area i am struggleing to see the process involved in converting a number of web-page documents (i.e. raw HTML pages consisting of approx 2000 words) into something which can be read into the CLUSTO system to allow it to cluster the documents. Would it therefore be posserble to point me towards any resources or even terminology which can be searched for for this pre-processing step?

