Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification

Eui-Hong (Sam) Han, George Karypis and Vipin Kumar
5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp.53-65, 2001
Download Paper
Text categorization is the task of deciding whether a document belongs to a set of prespecified classes of doc-uments. Automatic classification schemes can greatly facilitate the process of categorization. Categorization of documents is challenging, as the number of discriminating words can be very large. Many existing algorithms simply would not work with these many number of features. k-nearest neighbor (k-NN) classification is an instance-based learning algorithm that has shown to be very effective for a variety of problem domains including documents. The key element of this scheme is the availability of a similarity measure that is capable of identifying neighbors of a par-ticular document. A major drawback of the similarity measure used in k-NN is that it uses all features in computing distances. In many document data sets, only smaller number of the total vocabulary may be useful in categorizing documents. A possible approach to overcome this problem is to learn weights for different features (or words in document data sets). In this paper, we propose the Weight Adjusted k-Nearest Neighbor (WAKNN) classification algorithm that is based on the k-NN classification paradigm. In WAKNN, the weights of features are learned using an iterative algorithm. In the weight adjustment step, the weight of each feature is perturbed in small steps to see if the change improves the classification objective function. The feature with the most improvement in the objective function is identified and the corresponding weight is updated. The feature weights are used in the similarity measure computation such that important features contribute more in the similarity measure. Experiments on several real life document data sets show the promise of WAKNN, as it outperforms the state of the art classification algorithms such as C4.5, RIPPER, Rainbow, PEBLS, and VSM.
Research topics: Classification | Data mining | Text mining