Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization
George Karypis and Eui-Hong (Sam) Han |
9th International Conference on Information and Knowledge Management (CIKM), pp. 12 - 19, 2000 |
Download Paper |
Abstract Retriev al techniques based on dimensionality reduction, such as Latent Semantic Indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words present in the documents. Unfortunately, the high computational and memory requirements of LSI and its inability to compute an effective dimensionality reduction in a supervised setting limits its applicability. In this paper we present a fast supervised dimensionality reduction algorithm that is derived from the recently developed cluster-based unsupervised dimensionality reduction algorithms. We experimentally evaluate the quality of the lower dimensional spaces both in the context of document categorization and improvements in retrieval performance on a variety of different document collections. Our experiments show that the lower dimensional spaces computed by our algorithm consistently improve the performance of traditional algorithms such as C4.5, k-nearest-neighbor, and Support Vector Machines (SVM), by an average of 2% to 7%. Furthermore, the supervised lower dimensional space greatly improves the retrieval performance when compared to LSI. |
Research topics: Classification | Data mining | Information retrieval | Text mining |