Frequent Sub-structure Based Approaches for Classifying Chemical Compounds

Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis
IEEE Trans. Knowl. Data Eng. 17(8): 1036-1050, 2005
Download Paper
Computational techniques that build models to correctly assign chemical compounds to various classes of interest have many applications in pharmaceutical research and are used extensively at various phases during the drug development process. These techniques are used to solve a number of classification problems such as predicting whether or not a chemical compound has the desired biological activity, is toxic or non-toxic, and filtering out drug-like compounds from large compound libraries.

This paper presents a sub-structure-based classification algorithm that decouples the sub-structure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric sub-structures present in the dataset. The advantage of this approach is that during classification model construction, all relevant sub-structures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Experimental evaluation on eight different classification problems shows that our approach is computationally scalable and on the average, outperforms existing schemes by 7% to 35%.

This is an expanded version of the ICDM03 paper.
Research topics: Bioinformatics | Cheminformatics | Classification | Data mining