2.4 Tutorial 1: Performing the Basic Workflow in GUI Mode

To perform the steps of this tutorial, you will first need to execute ChemTree. This can be performed in the usual ways for your operating system such as double clicking the program’s icon or entering the program name on the command line of a console window. The end result is that ChemTree will be launched in GUI mode. Once ChemTree is executing, we can create a new project.

2.4.1 Create a New Project

Projects, in ChemTree, are the work area in which all data analysis takes place. (See section 3.1.2 for a complete description of creating a Project.) They provide a number of benefits. First, since they are named, they allow you to keep separate data analysis activities distinct from one another. Second, projects can be closed and later reopened without losing any of your work. Finally, each project automatically maintains a detailed history of all of the operations performed within it. This is important if you are ever asked to identify the steps that led to a particular analysis result.

New projects are created via the New Project dialog box. This dialog can be launched by simply selecting the File->New Project menu item or, alternatively, by clicking on the Create A New Project button on the main window button bar.


[Picture]
Figure 2.1: The New Project dialog window for entering the name of the new project

As you can see, the dialog requires that you provide a name for your project. We will call our project ’Discovery’. So, please enter that name, ’Discovery’, in the Project Name text field of the New Project dialog box.

The dialog box also allows you to specify the directory location of your project. Just click on the Browse button and you will see the following window.


[Picture]
Figure 2.2: The Browse For Folder window for selecting a parent folder for the project

For this tutorial, we will use the example subdirectory of the ChemTree installation directory, as our work area. Click the Browse button to open the Browse for Folder dialog box. Select the example subdirectory of the ChemTree installation directory and click the OK button.

When you have completed entering all of the necessary information, click on the Create Project button to close the dialog and create the project. Notice that the client area of the ChemTree main window has changed to show the Project Viewer.


[Picture]
Figure 2.3: The ChemTree main navigator window showing the Project Node

The Project Viewer has three panes. For the moment, we will focus on the left most pane, the Project Navigator pane, which contains an icon for the project that you just created.

Since our new project is empty, there is not much that we can do that is very interesting. To change that situation, we will import some data.

2.4.2 Import A Data Set

Importing data into a ChemTree project is generally very easy to do. When using your own data sets, you have to be sure that the data types of your data set conform with what ChemTree allows. The details are described in Chapter 4, Importing Your Data Into ChemTree.

There are several different ways to import data depending on the format of the original data set. For this tutorial, we will use the File->Import Data->Import HTS file menu item. There are a number of buttons and fields displayed in the HTS Import dialog box (see below).


[Picture]
Figure 2.4: The dialog window for importing compound data.

NOTE: If you have not purchased the multivariate feature of ChemTree, some of these fields will not show. However, the default choices are the same for both versions of this window.

We are going to import compound data from an SD file called external.SD found in the example subdirectory of the ChemTree installation directory. Simultaneously, we will import potency data for these compounds from a text file called external.dat.

Click on the Choose button which causes a file chooser dialog to appear. Navigate to the example subdirectory of the ChemTree installation directory where you will find the file external.SD. Click on the file name and then click on the Open button which will cause the file chooser dialog to close.

You are now back at the dialog window for Import SD file and you see the Input(MDL) SD file field filled in with appropriate information. By default, the Atom path lengths checkbox is checked, the Integer Low and High radio button is selected and the Path length descriptor field is set to five by default. For this example we will accept all of these defaults.

Next we check the Descriptors/Potencies from file below checkbox which activates the final two dialog text fields for entry. Click the Choose button to open a file browser and navigate to the example subdirectory of the ChemTree installation directory where you find the file external.dat. Click on the file name and then click on the Open button which causes the file chooser dialog to close. You are again returned to the first dialog box and the last two text fields are filled in with the selected file name and "Space delimited" respectively.


[Picture]
Figure 2.5: The completed dialog window (non-multivariate version).

Now click the OK button. At this point an additional dialog appears for you to select the correct compound name field within the SD file. (See Fig. 2.6) Not only will this name field later be used to identify the compounds, but if a descriptor/potency file is being used, the names in this field must match the compound names in the descriptor/potency file for all compounds which need to be imported.


[Picture]
Figure 2.6: Dialog for selecting a label.

NOTE: It is not necessary for all compounds to match between the two files in order to perform an import. Also, it is not necessary to have the matching compounds sorted in the same order between the two files. Neither is it necessary for every compound in the SD file to be readable. Only those compounds which are readable in the SD file and match between the two files will be imported. Thus, a slight corruption of data can be tolerated during the import process.

The Header Name field is highlighted by default. Click the OK button to accept the highlighted field. ChemTree will now begin the import process. Several progress bars will appear for each step of the import. When the import is complete an information dialog appears telling you that 2018 compounds were imported (see Fig. 2.7). Finish by clicking the OK button.


[Picture]
Figure 2.7: Dialog showing import total.

At this point, all the open dialog windows will be closed and a spreadsheet viewer will open up showing the imported data. The imported data and the spreadsheet will also be represented by two new icons in the Project Navigator window (see Fig. 2.8). The spreadsheet will be called Sheet 1 by default–you may want to give it a more descriptive name.


[Picture]
Figure 2.8: Project Navigator Window after import.

2.4.3 View Imported Data in a Spread Sheet Viewer

The Spread Sheet Viewer window has its own menu and button bar and otherwise looks like a spread sheet.


[Picture]
Figure 2.9: The Spread Sheet Viewer

The left most column contains the row number. This is followed immediately by a column for the compound name, labeled in this example as ComID. By scrolling down the spread sheet you will be able to verify that there are 2018 compounds in this data set.

NOTE: If you right-click on any compound name, a view of that compound will appear in a separate window.

You should see the potency data in the first data column. By horizontally scrolling, you should be able to verify that after the first data column, which contains the potency, there are over a thousand columns consisting of more than 500 pairs of columns labelled “PLLO:...” and “PLHI:...”, respectively, standing for “path-length low” and “path-length high”. These columns, which relate to the structural features of the compounds, were generated by ChemTree during the import process using the compound descriptions contained in the SD file. They describe the lowest possible path length and the highest possible path length, respectively, between the two augmented atoms designated in each of their column headers after the “PLLO:” or the “PLHI:”. If either of the augmented atoms is not in the compound, the corresponding spreadsheet cell will show the “missing data” symbol, a ’?’. (In this case, “missing data” may be interpreted as one or both of the atoms in the augmented atom pair was not present for that particular compound.)

NOTE: If you have generated your own compound descriptors, it is possible to import them into ChemTree by including them with the potencies (and other fields) in the “Descriptor/Potency” file. However, in this example, we will use only those descriptors generated by ChemTree which are described above.

2.4.4 Identify the Data Columns to be Used in Analysis

Our goal in this tutorial is to establish whether a useful prediction model can be created from this dataset. For purposes of analysis, we need to be able to enable, disable, and show dependency relationships between columns. All of this can be achieved by clicking on the column label and noting the resultant color coding of the data items. Initially, all the columns of a newly imported data set are enabled so the data should be displayed as black text on a white background. Data columns whose display color is black are considered independent variables.

By clicking on a column label one time, the display color of its data will change from black to magenta. This color change indicates that the column is now a dependent variable. By clicking the column a second time, the data’s display color will change from magenta to gray. The gray color identifies a disabled column. By clicking a column label a third time, the data’s display color will return to black identifying the column as an enabled, independent variable. Columns of data can also be sorted by clicking on the column number bar just above the column label. Each successive click will reverse the sort. For our tutorial, make the potency column (labelled as Y) dependent by clicking it once. Leave all the other columns as enabled, independent variables.


[Picture]
Figure 2.10: Spread Sheet Viewer showing the dependent column.

2.4.5 Perform the Analysis

All analysis is performed in the context of the Spread Sheet Viewer. For the Discovery project, our first analysis effort will be very simple. We begin by randomly selecting half of the rows. Later we will apply the predictions from this half to the other half to determine if this data set can be used for other predictions.

From the Spread Sheet Viewer menu, select Edit->Select Row Subset and a selection dialog will appear with the Random fraction: selection chosen by default and 0.5 as the fraction default.


[Picture]
Figure 2.11: The dialog window showing the subsets that can be selected

Click OK and you see that a random half of the rows will be selected, while the other half is de-selected (grayed out).


[Picture]
Figure 2.12: Spreadsheet view of randomly deactivated rows

From the Spread Sheet Viewer menu, select the Analysis->Interactive Tree Analysis menu item. A new window called Tree View 1 is now displayed.


[Picture]
Figure 2.13: Tree View 1

In the client area of this window is a box which represents (and which we will call) the root node. The root node has several variables displayed in it:

  1. Y, potency, the dependent variable of our data set.
  2. n, the number of elements of the data set that the node represents.
  3. μ, the mean potency of the data set. NOTE: Since this is a binary variable–that is, it can only take on values of zero or one–the meaning of “mean” is “what fraction of the data is one rather than zero”.
  4. s, the standard deviation.
  5. N (followed by zero or more digits) is a unique identifier for each node. The digits represent branch position in the tree as viewed from left to right.
  6. I, in the lower right corner of the node will display a Node Annotator dialog if clicked on.

Notice that our tree node contains only 1009 observations after selecting a random 50 percent from the spreadsheet. We now use this tree viewer to create a forest of random trees for making predictions by selecting the Tree->Extend Current Tree Randomly menu item. This brings up a dialog of options for generating trees. See Chapter 9 on Random Tree Generation for complete details on this process.


[Picture]
Figure 2.14: The Random Tree Creation dialog that defines the scope of the tree.

Experience has shown that relaxing the P value threshold and setting the max segments to 3, causing an over-building of the random trees, ends up providing a better prediction model. See section 8.2 for more details.

So, set the max segments to 3, the P value threshold to .99 and accept all the other default settings. Click GO to continue. Progress bars appear followed by a window confirming that 100 random trees were created. Click OK. A new icon, called a Multitree Model, is now added to the Project Navigator Window and the Multitree Viewer is displayed (see Fig. 2.15).


[Picture]
Figure 2.15: Multitree Viewer for 100 random trees

2.4.6 Checking Predictions of Multitree Model

We now apply the Multitree Model to the random 50 percent of the data that was previously grayed out. First return to the Spread Sheet Viewer and select the menu option Edit->Invert selection.


[Picture]
Figure 2.16: Notice that the row buttons on the left are inverted from Fig. 2.12

This creates a new Spread Sheet View, called Sheet 2, with the previously greyed out rows now activated, and the previously activated rows now greyed out. We are now ready to apply the Multitree Model created with the first random half of the data to the remaining half of the data. Using the new Sheet 2, select the Analysis->Apply a Tree Model menu choice. This brings up the Tree Model Chooser shown below. Possible tree models are highlighted in white.


[Picture]
Figure 2.17: Multitree Model Chooser

Notice that both single-tree and multiple-tree models can be used. Now click on the Multitree Model with the 100 random trees we previously created. It will change to a blue highlight. Click OK to continue. The highlighted tree model will now be applied to the highlighted rows of the current spreadsheet, creating an Applied Tree Model icon in the Project Navigator window and opening up the Applied Tree Model Viewer shown below.


[Picture]
Figure 2.18: Applied Tree Model Viewer

Looking at the Applied Tree Model Viewer and scrolling through its tree list, we see that the Root Mean Square error on the holdout sample, displayed in the column, “Pred. RMS Err.” ranges from .44 to .50. This error relates to how far off the potency predictions are from their actual values. In a tree with an RMS error of .45, the average deviation from the actual value is 0.45 units. This column can be compared with the “Orig. RMS Err.” column which shows the RMS error in sample, which is significantly lower as would be expected with an over-fitted model.

To complete our tutorial select the Predictions->View Average Tree Predictions menu choice from the Applied Tree Model viewer. This creates a spreadsheet showing the predictions from the tree model verses the actual values in the spreadsheet. Click on the column number 4 header to sort the column of Predicted values. The average predictions spreadsheet now should look like Fig. 2.19.


[Picture]
Figure 2.19: Average Tree Predictions

Note that the dependent variable, potency, is a binary value, that is, it may only take on the values “Yes” or “No” or one or zero. Thus, it would not be expected that very many tree predictions for a compound’s potency would match an exact 1.00 or 0.00 value. However, spot-checking the average tree predictions spreadsheet confirms that many compounds whose actual value is one are predicted to have a high value, such as .98 or .99, and many compounds whose actual value is zero are predicted to have a low value, such as .15 or .11.

Scrolling the spreadsheet, now that it has been sorted by predicted values, shows that almost all compounds with low predictions, such as less than .11, really do have a “No” (zero) value for their potency. Similarly, almost all compounds with high predicted values, such as more than .76, really have a “Yes” (one) value for their potency. In between, there is a gradation of how many compounds are “Yes” vs. “No”. This is beginning to show that there is a substantial predictive power with this model, at least for the other half of the same dataset.