Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization

George Karypis and Eui-Hong (Sam) Han
9th International Conference on Information and Knowledge Management (CIKM), pp. 12 - 19, 2000
Download Paper
Abstract
Retriev al techniques based on dimensionality reduction, such
as Latent Semantic Indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words present in the documents. Unfortunately, the high computational and memory
requirements of LSI and its inability to compute an effective dimensionality reduction in a supervised setting limits
its applicability. In this paper we present a fast supervised
dimensionality reduction algorithm that is derived from the
recently developed cluster-based unsupervised dimensionality reduction algorithms. We experimentally evaluate the
quality of the lower dimensional spaces both in the context
of document categorization and improvements in retrieval
performance on a variety of different document collections.
Our experiments show that the lower dimensional spaces
computed by our algorithm consistently improve the performance of traditional algorithms such as C4.5, k-nearest-neighbor, and Support Vector Machines (SVM), by an average of 2% to 7%. Furthermore, the supervised lower dimensional space greatly improves the retrieval performance
when compared to LSI.
Research topics: Classification | Data mining | Information retrieval | Text mining