AFGen - Chemical Compound Descriptors based on Acyclic Fragments

The AFGen Program

AFGen is a program that takes as input a set of chemical compounds and generates their vector-space representation based on the set of fragment-based descriptors they contain. These fragments are based on bounded length connected acyclic subgraphs, which includes both trees and paths. This vector-based representation can be used for different tasks in cheminformatics including similarity search, virtual screening, and library design.

Installation

AFGen is currently distributed in a binary format with executables for Linux and MS Windows. AFGen can be obtained from here and it consists of a single archive file in either Unix tar.gz or Windows zip format.

The following files constitute AFGen's distribution:

AFGen This is the Linux binary.
AFGenWin.exe This is the MS Windows binary. It is a console program and should be executed from within a command-line window (i.e., cmd in Windows XP).
manual.html This is AFGen's documentation (i.e., this file)

sample.sdf This is a sample input file containing a set of compounds.

afpaper.pdf A paper containing a comprehensive experimental evaluation of the descriptors generated by AFGen.
VERSION This file contains the version number of the distribution.
LICENSE This file contains the copyright notice and license information.

Command Line Interface

Usage

AFGen [options] <input file>

Input

AFGen's only required parameter is a file that stores the input compounds. AFGen supports the sdf and mol2 file formats that are selected based on the extension of the input file. If the file has an ".sdf" extension, then AFGen expects the compounds to be specified in SDF format; whereas if the extension is ".mol2", AFGen assumes that the file is in Mol2 format. For more information on SDF file format visit MDL and for Mol2 format visit TRIPOS.

Note that only a single input file is allowed that stores all the compounds to be analyzed.

Output

The output consists of two files: fragfile and descrfile. The fragfile contains the fragments that were generated by AFGen, whereas the descrfile contains the fragment-based representation of each input compound (i.e., the descriptor representation).

The fragments are stored using the same format as that used by the input compounds (i.e., sdf or mol2). The name of the fragfile is derived from the name of the input file by appending a "_frags.sdf" or "_frags.mol2" on the input filename's filestem.

Example 1:
If the input filename is "mycompounds.sdf", the name of the fragfile is "mycompounds_frags.sdf".

For labeling purposes, AFGen assigns to each of these fragments an identifier from 1 to N, where N is the total number of unique fragments that were generated.

The descriptor-based representation of the compounds is stored in the descrfile file. The name of descrfile is derived from the name of the input file by appending a ".out" on the input filename's filestem.

Example 2:
If the input filename is "mycompounds.sdf", the name of the descrfile is "mycompounds.out".

The descrfile contains as many lines as the number of compounds and the ith line stores the descriptor-based representation of the ith input compound. The descriptor-based representation of each compound is a comma separated list whose first entry is the compound's identifier (as specified in the input file) followed by a list of (fragment-identifiers, occurrence-frequency) pairs.

Example 3:

"Benzene",2,1,10,1,58,2,64,1 ... 
"Folic Acid",10,1,50,4 ... 
...
...

In this example the compound "Benzene" has fragments 2, 10, 58 etc. and their frequencies are 1, 1, and 2 respectively.

The fragment identifiers correspond to the numerical identifiers assigned to these fragments in fragfile (i.e., from 1 to N). Note that the occurrence frequency is nothing more than the number of times each fragment occurs in the compound. Two occurrences are considered different if they have at least one different edge.

Options

-ds {AF,TF,PF}

Specifies the type of fragments to be generated. The possible values are:
AF     Acyclic Fragments (default)
TF     Tree Fragments (only acyclic fragments consisting of trees)
PF     Path Fragments (only acyclic fragments consisting of paths)

-lmin [1...]

Specifies the minimum number of bonds (i.e., length) of the generated fragments. The default value is one.

-lmax [1...]

Specifies the maximum number of bonds of the generated fragments. Note that lmax must be greater than or equal to lmin. The default value is seven.

-fmin [1...]

Specifies the minimum frequency that a fragment must have before it becomes a descriptor. The frequency of a fragment is based on the number of distinct compounds that it occurs at. The default value is one (i.e., all fragments are treated as descriptors).

-NoAtmLabels

This option forces AFGen to ignore the fine atom typing specified in the input file (if any). If this option is used, then only the basic atom types are used (e.g., P, N, O, etc.). This option applies only to inputs files that use the Mol2 format, as the SDF format does not support fine atom typing. By default AFGen uses the supplied atom typing.

-NoBndLabels

This option forces AFGen to ignore the bond typing specified in the input file (if any). If this option is used, then all bonds are treated as belonging to the same type. By default AFGen uses the supplied bond typing.

-ofile <outfstem>

Specifies the stem of the output file. The output files will be of the forms outfstem.out, outfstem_frags.sdf, and outfstem_frag.mol2 for the descriptor space and fragment files, respectively. If output stem is not specified then the output stem is the same as input stem.

Examples

> AFGen -ds AF -lmin 2 -lmax 6 -fmin 1 -ofile output sample.sdf

Generates all Acyclic Fragments containing between 2 and 6 bonds that occur in at least one compound. The generated fragments will be stored at the file output_frag.sdf and the fragment-based representation of each compound will be stored in output.out.

Compound Pre-processing with OpenBabel

Quite often, chemical compounds are pre-processed prior to descriptor generation. Examples of such pre-processing steps (often referred to as structure normalization) are the removal of all hydrogen atoms and/or identification of aromatic bonds. AFGen does not provide any mechanisms by which to perform such normalization operations. However, if desired, the open source package OpenBabel can be used to pre-process the input files prior to using AFGen. Information on how to obtain and install OpenBabel is available at http://openbabel.sourceforge.net.

Once installed, the OpenBabel package can be used to perform such normalizations as follows:

Example 4: Delete All Hydrogen Atoms.

>babel -d -isdf sample.sdf -osdf sample-noH.sdf

This will remove all hydrogen atoms from the input file sample.sdf and write the new compounds in the sample-noH.sdf file.

Example 5: Detect and Label the Aromatic Bonds.

>babel -isdf sample.sdf -omol2 sample.mol2

This will detect the aromatic bonds that are present in the compounds of the input file sample.sdf and write them into the file sample.mol2. Note that the aromatic bond typing is a standard feature of the Mol2 file and as such, the aromatic bond detection is actually done as a result of converting the input SDF file into a Mol2 file.

Example 6: Delete All Hydrogen Atoms and Detect and Label the Aromatic Bonds.

>babel -d -isdf sample.sdf -omol2 sample.mol2

This essentially combines the operations performed in the previous two examples.

Contact Information

If you have any questions or problems with AFGen please send an email to karypis@cs.umn.edu.

Citing AFGen

The included afpaper.pdf provides a detailed experimental evaluation of the descriptors generated by AFGen and compares their performance against that achieved by other widely used descriptors.

In citing AFGen in your papers, please use the following reference:

"Acyclic Subgraph-based Descriptor Spaces for Chemical Compound Retrieval and Classification" Nikil Wale and George Karypis. UMN CSE Technical Report #06-008, 2006.

Copyright and License Information

AFGen is primarily written by Nikil Wale and is copyrighted by the Regents of the University of Minnesota. It can be freely used for educational and research purposes by non-profit institutions and US government agencies only. Other organizations are allowed to use AFGen only for evaluation purposes, and any further uses will require prior approval.

The software may not be sold or redistributed without prior approval. One may make copies of the software for their use provided that the copies, are not sold or distributed, are used under the same terms and conditions.

As unestablished research software, this code is provided on an ``as is'' basis without warranty of any kind, either expressed or implied. The downloading, or executing any part of this software constitutes an implicit agreement to these terms. These terms and conditions are subject to change at any time without prior notice.

`AFGen`		This is the Linux binary.
`AFGenWin.exe`		This is the MS Windows binary. It is a console program and should be executed from within a command-line window (i.e., `cmd` in Windows XP).
`manual.html`		This is AFGen's documentation (i.e., this file)
`sample.sdf`		This is a sample input file containing a set of compounds.
`afpaper.pdf`		A paper containing a comprehensive experimental evaluation of the descriptors generated by AFGen.
`VERSION`		This file contains the version number of the distribution.
`LICENSE`		This file contains the copyright notice and license information.