4.3 Importing Data
The File->Import Data submenu presents two menu items for importing different types of data:
These methods for importing data are described in the following sections.
4.3.1 Importing Legacy GHD Files
Users of ChemTree prior to release 4.0 may have data files already in the GHD format. In fact, some of the files included in the examples directory are stored in legacy format. In order to continue to be used, these legacy files need to be imported into a project. This is done by selecting the File->Import Data->Import legacy GHD File menu item. An Open dialog is displayed which allows you to navigate to the folder where the GHD file is located. Selecting the GHD file and clicking on the Open button will cause the file to be imported, the Open dialog to be closed and control to be returned to the ChemTree main window. The imported GHD file is automatically added to the current project and displayed in a Spread Sheet Viewer.
4.3.2 Importing HTS Files
The MDL SD file format consists of multiple compounds, one after another. Each compound has its name, and is described in terms of atomic coordinates, and connectivity information. Additionally, each compound can contain compound descriptors and/or activity information after the molecular connectivity information. This information can optionally also be imported into ChemTree.
Additional molecular descriptors and/or potency results of a screen may be optionally imported simultaneously from an external file.
4.3.2.1 Atom Path Lengths
These descriptors, which may automatically be generated by ChemTree from the SD file when it is imported, are path lengths between atom types. We use augmented atoms to describe atom types. An augmented atom consists of a focal atom, and the non-hydrogen atoms immediately bonded to that focal atom. For instance an NO2 group within a compound, bonded to a carbon, could be described in terms of a focal Nitrogen group bonded to a carbon and two oxygens, denoted N(COO). Path lengths are enumerated in terms of the number of bonds between one focal atom and another. Observe the next figure:
|
There is both a binary and an integer form of this path length. The binary form specifies the presence or absence of a particular atom at a particular distance, and is only available for multivariate analysis. The integer form, used for univariate analysis, uses the distance between the atom pair as the value for the descriptor. Because focal atoms may not be unique within a compound, we create as descriptors the lowest and highest shortest path length seen between focal atoms for the integer descriptor. In the figure above, an oxygen atom bonded only to a carbon atom, O(C), is 8 bonds away from a carbon bonded to two other carbons C(CC). Note, however, that there are two other O(C) focal atoms only 3 bonds away from the circled C(CC). So we create two descriptors for the atom pair, PLLO:O(C)-C(CC) and PLHI:O(C)-C(CC). For this compound the value of PLLO:O(C)-C(CC) is 3 and the value of PLHI:O(C)-C(CC) is 8. Other compounds may not have one or the other of these focal atoms, and hence there is no path length. The PLLO and PLHI value for this compound would be missing, denoted by a question mark, “?”.
4.3.2.2 Augmented Atoms
These binary descriptors, which may also be generated by ChemTree, represent the presence or absence of a particular heavy atom with its immediately bonded neighbors. This descriptor is only available for multivariate analysis. It is best used in conjunction with the augmented atom pair descriptors, as single atom descriptors cannot describe topological distances.
4.3.2.3 User-specified descriptors
This type of descriptor is for experts who want to create and compute their own descriptors for compounds. Any descriptor about a compound can be imported from a user-specified file. Various types of compound descriptors can be employed. Continuous measures such as molecular weight, or logP can be imported. Continuous predictors may be somewhat slower to compute splits on than other descriptor types, but they can be very powerful predictors of activity. Integer measures such as a count of the number of heavy atoms, or a count of the number of occurrences of a fragment can be used. Categorical descriptors, such as qualitative descriptions are also possible. An example might be odor, where there is not a natural ordering of the descriptions. The format for the user-specified file is described below.
4.3.2.4 User Descriptor/Potency file
A user descriptor file is used to import user-defined compound descriptors and/or activity measures. The file is comma- or space-delimited and contains a row of headers, followed by rows of data that correspond to the header. The first column must be the compound name. An example descriptor file follows:
Compound, | Potency, | -LogP, | MolW,
| ||||
100-00-5, | 5.37, | 7.6, | 125,
| ||||
100-01-6, | 2.06, | 5.1, | 453,
| ||||
100-02-7, | 4.89, | 4.3, | 987,
| ||||
100-03-8, | 7.68, | 8.9, | 123,
| ||||
100-14-1, | 4.27, | 2.7, | 345,
| ||||
4.3.2.5 Import Dialog
The HTS import dialog is responsible for the import of both the SD file and the supplemental descriptor/potency file, if there is one, into ChemTree. As it does this, it creates the requested (automatically generated) descriptors from the compound information.
|
In the opening screen, the menu File->Import Data-> Import HTS File allows us to open the Data Importation window. Pictured next are the two versions of this window that may open, depending upon whether you have purchased the multivariate feature.
|
|
4.3.2.6 Input (MDL) SD File
The potency and/or molecular features data can be imported from the SD file or from a separate file, according to your preference. Click on Choose and select the SD file from your assay.
4.3.2.7 Augmented Atoms (Multivariate Only)
If you have purchased the multivariate feature as a part of ChemTree, you may choose to have binary augmented-atom descriptors generated for you from the compound information by checking this box.
4.3.2.8 Atom Path Lengths
If you want atom path lengths to be generated from the compound information, check this box. Also, make sure the number of times a path length descriptor must appear is correct for your needs (default 5). At least this number of compounds must have given rise to the given path length descriptor for the descriptor to be included.
NOTE: For large data sets it may be desirable to set this threshold high so as to reduce the number of descriptors and increase the processing speed. For instance, for 30,000 compounds a threshold such as 100 will still give good prediction and modeling performance, yet speed up processing time dramatically.
If you have purchased the multivariate feature as a part of ChemTree, select the desired radio button for whether you want integer or binary path length descriptors. If you select binary descriptors, also make sure the maximum path length for which you would want a separate descriptor to be made is correct for your needs (default 50).
4.3.2.9 Descriptor/Potency File
If you wish to simultaneously import a separate descriptor/potency file, check the Descriptors/Potencies from file below box, and fill in the file name or click on Choose to select the file. Also, be sure to select whether the descriptor/potency file is space- or comma-delimited.
See 4.3.2.4 for the format of this file.
NOTE: The compound identifiers are assumed to correspond to the compound name field you want to use in the SD file. (See 4.3.2.10.)
NOTE: Very early releases of ChemTree required that there be a one-to-one in the same order mapping between rows of the descriptor/potency file and compound structure. This is no longer the case. The compounds can now be put in any order, and ChemTree will match them up. If some compound names don’t match up with entries in the descriptor/potency file, a warning will be displayed afterward telling how many compounds couldn’t be imported.
4.3.2.10 Additional Dialog(s)
After you click the OK button to start the importation, an additional dialog appears so you may select the correct compound name field within the SD file. The dialog gives possible fields to be selected, plus a sample value for each field derived from the first (valid) compound in the SD file. (See Fig. 4.6)
|
The Header Name field is selected by default. If you are also importing a descriptor/potency file, remember to match the name field in that file when you select this field.
After you select the Header Name field, importation will start. After an initial file SD file scan, and if you have any SD fields not needed for the compound name, you will additionally be asked to select, from a very similar-looking window, the descriptor fields you wish to import directly from the SD file.