12.1 Observation Distance Matrix Overview
The Observation Distance Matrix is a distance matrix of each observation in the data set with every other observation. The distance matrix is based upon the idea that when two observations end up together in a small subset deep within a tree, the descriptors that drive their response are shared and, hence, the observations are similar in model space.
Within a given tree, the distance between observation (i,j) is given by the total number of observations in the deepest node in the tree where the observations are together, divided by the total number of observations at the root of the tree. The overall distance between observation (i,j) is the average of the distances calculated for this pair over all the trees in a multiple tree model.
Consider an example where there are 1000 compounds in a tree, and compounds i and j are split apart into daughter nodes beneath the root node. In this case, the deepest node that i and j appear together is the root node itself, so the distance between them is 1000/1000=1. Instead, suppose that i and j ended up in a node near the bottom of the tree that contained 10 observations. Then the distance would be: d(i,j)=10/1000=0.01.
Note that this similarity metric contrasts with subgraph similarity metrics, in that the specific features that drive the response are the ones used to compute the distance between the compounds. It may be the case that two compounds are very different, but share a key functional group that drives activity, and they will come up as being similar, despite their overall graph dissimilarity.
12.1.1 Creating an Observation Distance Matrix
From the Navigator Window view, click on a project or File->Open New Project and import the data to analyze. From the spreadsheet view, click Analysis->Create a Multiple Tree Model.
A Random Tree Creation window opens. Check or change any tree creation variables and Click Go.
After the random trees are created a Multitree Model window opens (Fig. 12.1) displaying a list of the random trees.
|
|
Figure 12.2 displays the menus leading to generating an Observation Distance Matrix. From the Observations->Plot Obs.
Dist. Matrix menus, the following options are available:
Unsorted The symmetric matrix of observations is displayed ordered in the same sequence as they appear in the
spreadsheet from which they came. The first observation is at the lower left, and the last observation is at the upper
right.
Sorted by 1st Principal Component The matrix is sorted in ascending order by the 1st principal component of the distance
matrix. This is a simple clustering approach. The 1st principal component is the eigenvector corresponding to the largest
eigenvalue of the distance matrix. Because calculating this first principal component takes O(n3) operations
where n is the number of observations, you will not want to create this plot for more than a few thousand
observations.
Sorted by Similarity to One Observation This option is most useful when we are interested in seeing which observations are
most similar to given observation. For instance, a Chemist may wish to see the 100 compounds most similar to a given
compound of interest . A dialog pops up giving a list of all the observations, from which you can choose one. Then the
distance matrix is displayed showing the given observation in the lower left and the k most similar observations from most
similar to least similar are shown ascending along the diagonal.
|
12.1.2 The Observation Distance Matrix
|
Figure 12.4 show the distance matrix created from the 100 random trees built on a subset of 1009 compounds from the external.ghd file imported into a project. The dark blue colored squares near the diagonal represent compounds with high similarity. Note in the bottom left, there are about 100 compounds which are similar within that group, but are dissimilar to the other ~900 compounds above, where we see a dark red band going up and across to the right.
Following is a synopsis of the use and meaning of the pull-down menus and buttons in the Distance Matrix Plot window:
12.1.3 Set Axes
The top two pull-down menus allow you to set the axes to either Distance or Response. The Response axis plots the dependent variable, so the actual wording will reflect the column heading in the spreadsheet. For Multivariate trees, several Response headings will be available. The default settings for both axes are set to Distance so the distance matrix is first displayed symmetrically. It is also possible to select the Response as one axis, enabling the location of clusters that are not only similar, but also have a desired response range.
12.1.4 Stop Calculation/Stop Refresh and Restore Calculation/Restore Refresh
In Fig. 12.4 , the top button (shown saying Refresh) says Stop Calculation only during the initial calculation of the matrix. To stop the calculation, press the Stop Calculation button. Pressing the Stop Calculation button causes the button to change to Resume Calculation. To continue the interrupted computation of the distance matrix click Resume Calculation.The resulting calculation is stored in an internal matrix for faster re-plotting.
After the initial calculation is finished, the button shows Refresh as in the figure above. If you click the Refresh button, the matrix is recalculated. During the recalculation the button face changes to Stop Refresh. To stop the recalculation, press the Stop Refresh. Pressing the Stop Refresh button causes the button to change to Resume Refresh. Click Resume Refresh to continue the interrupted computation of the distance matrix.
12.1.5 Copy to Clipboard
The Copy to Clipboard button copies any table of numbers appearing at lower left to the clipboard for pasting into other applications. This table is updated whenever the mouse is clicked on the plot or when the arrow keys are used to move from point to point on the plot. If no table is present, then the button is greyed-out as in Fig. 12.4.
12.1.6 Creating a Spreadsheet or Tree view from the Matrix Plot
|
The center drop down menu in the lower right corner allows for two actions to be taken by left clicking and dragging the mouse diagonally over a region of the matrix plot: Left click and drag for spreadsheet or Left click and drag for tree. Depending on your choice a spreadsheet or a tree, a window opens containing the observations for the defined region. See Section 12.2.1 for detailed examples.
12.1.7 Zoom Mode
By Right-clicking and dragging diagonally across a region of matrix plot, a new window opens with a close-up plot of the defined region. A file tab appears at the bottom of the window labelling the zoomed-in view. To return to the original view, either click on the Distance Matrix Plot file tab or delete the zoomed-in window by clicking on the “X” in the lower-right corner.
12.1.8 Modify Color Scaling
The color guide at the right of and above the plot defines the mapping between the colors and the distance plot. It is possible to narrower the window of color by clicking and dragging over a region of the color guide.
If you Shift-click and drag over a region of one axis, both axes change symmetrically (i.e., use the same measure) and have an identical range.
To undo or reset the color scales, both pull down menus for Distance/Response allow you to either: reset this scale, reset symmetrically, or reset all scales
12.1.9 Effect of Clicking on the Plot
Click on a point within, or on a compound name along the bottom or side of the plot. A table of statistics for the compound pair is generated in the lower left corner of the window (as seen in Fig. 12.6.
In some plots, more than one compound fits within one pixel. In this case, a star will appear in the lower left corner display after the compound name. The arrow keys on the keyboard may be used to maneuver through the values. (When this occurs, the values are averaged over the pixel. Right-mouse-button zooming is recommended to get a better visualization under this circumstance.)
12.1.10 Color Drop Down Menu
The bottom drop down menu allows you to exercise your artistic side and choose from the following color combinations to be
applied to the Distance Matrix:
Multi-color displays the matrix with a full rainbow color scheme.
Blue-red is the default display and creates the matrix in a blue and red color continuum.
white-red is a colorful and yet conservatively monotonic color combination.
Black-white is the least likely presentation of the Distance Matrix to fool someone with color blindness.
OK, we said exercise your artistic side, not necessarily fulfill all avenues of expression, but we apologize for any artistic
frustration this menu may have engendered.