A segment-based approach to clustering multi-topic documents

Andrea Tagarelli and George Karypis
Knowledge and Information Systems, Sptember, 2012
Download Paper
Document clustering has been recognized as a central problem in text data man- agement. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Exist- ing methods for document clustering have traditionally assumed that a document is an indi- visible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clus- tering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.
Research topics: Clustering | Data mining