12.1 Observation Distance Matrix Overview

The Observation Distance Matrix is a distance matrix of each observation in the data set with every other observation. The distance matrix is based upon the idea that when two observations end up together in a small subset deep within a tree, the descriptors that drive their response are shared and, hence, the observations are similar in model space.

Within a given tree, the distance between observation (i,j) is given by the total number of observations in the deepest node in the tree where the observations are together, divided by the total number of observations at the root of the tree. The overall distance between observation (i,j) is the average of the distances calculated for this pair over all the trees in a multiple tree model.

Consider an example where there are 1000 observations in a tree, and observations i and j are split apart into daughter nodes beneath the root node. In this case, the deepest node that i and j appear together is the root node itself, so the distance between them is 1000/1000=1. Instead, suppose that i and j ended up in a node near the bottom of the tree that contained 10 observations. Then the distance would be: d(i,j)=10/1000=0.01.

Note that this similarity metric contrasts with other similarity metrics, in that the specific features that drive the response are the ones used to compute the distance between the observations.

12.1.1 Creating an Observation Distance Matrix

From the Navigator Window view, click on a project or File->Open New Project and import the data to analyze. From the spreadsheet view, click Analysis->Create a Multiple Tree Model.

A Random Tree Creation window opens. Check or change any tree creation variables and Click Go.

After the random trees are created a Multitree Model window opens (Fig. 12.1) displaying a list of the random trees.


[Picture]
Figure 12.1: The Multitree Model window


[Picture]
Figure 12.2: The menus leading to generating an Observation Distance Matrix

Figure 12.2 displays the menus leading to generating an Observation Distance Matrix. From the Observations->Plot Obs. Dist. Matrix menus, the following options are available:

Unsorted The symmetric matrix of observations is displayed ordered in the same sequence as they appear in the spreadsheet from which they came. The first observation is at the lower left, and the last observation is at the upper right.

Sorted by 1st Principal Component The matrix is sorted in ascending order by the 1st principal component of the distance matrix. This is a simple clustering approach. The 1st principal component is the eigenvector corresponding to the largest eigenvalue of the distance matrix. Because calculating this first principal component takes O(n3) operations where n is the number of observations, you will not want to create this plot for more than a few thousand observations.

Sorted by Similarity to One Observation This option is most useful when we are interested in seeing which observations are most similar to given observation. For instance, a scientist may wish to see the 100 patients most similar to a given patient of interest. A dialog pops up giving a list of all the observations, from which you can choose one. Then the distance matrix is displayed showing the given observation in the lower left and the k most similar observations from most similar to least similar are shown ascending along the diagonal.


[Picture]
Figure 12.3: On a large number of trees, a progress indicator monitors the process.

12.1.2 The Observation Distance Matrix


[Picture]
Figure 12.4: The Observation Distance Matrix Plot in all its glory

Figure 12.4 shows the distance matrix created from the 1000 trees built from CSIM.ghd.

Following is a synopsis of the use and meaning of the pull-down menus and buttons in the Distance Matrix Plot window:

12.1.3 Set Axes

The top two pull-down menus allow you to set the axes to either Distance or Response. The Response axis plots the dependent variable, so the actual wording will reflect the column heading in the spreadsheet. For Multivariate trees, several Response headings will be available. The default settings for both axes are set to Distance so the distance matrix is first displayed symmetrically. It is also possible to select the Response as one axis, enabling the location of clusters that are not only similar, but also have a desired response range.

12.1.4 Stop Calculation/Stop Refresh and Restore Calculation/Restore Refresh

In Fig. 12.4 , the top button (shown saying Refresh) says Stop Calculation only during the initial calculation of the matrix. To stop the calculation, press the Stop Calculation button. Pressing the Stop Calculation button causes the button to change to Resume Calculation. To continue the interrupted computation of the distance matrix click Resume Calculation.The resulting calculation is stored in an internal matrix for faster re-plotting.

After the initial calculation is finished, the button shows Refresh as in the figure above. If you click the Refresh button, the matrix is recalculated. During the recalculation the button face changes to Stop Refresh. To stop the recalculation, press the Stop Refresh. Pressing the Stop Refresh button causes the button to change to Resume Refresh. Click Resume Refresh to continue the interrupted computation of the distance matrix.

12.1.5 Copy to Clipboard

The Copy to Clipboard button copies any table of numbers appearing at lower left to the clipboard for pasting into other applications. This table is updated whenever the mouse is clicked on the plot or when the arrow keys are used to move from point to point on the plot. If no table is present, then the button is greyed-out as in Fig. 12.4.

12.1.6 Creating a Spreadsheet or Tree view from the Matrix Plot


[Picture]
Figure 12.5: Left-click and drag drop down menu

The center drop down menu in the lower right corner allows for two actions to be taken by left clicking and dragging the mouse diagonally over a region of the matrix plot: Left click and drag for spreadsheet or Left click and drag for tree. Depending on your choice a spreadsheet or a tree, a window opens containing the observations for the defined region. See Section 12.2.1 for detailed examples.

12.1.7 Zoom Mode

By Right-clicking and dragging diagonally across a region of matrix plot, a new window opens with a close-up plot of the defined region. A file tab appears at the bottom of the window labelling the zoomed-in view. To return to the original view, either click on the Distance Matrix Plot file tab or delete the zoomed-in window by clicking on the “X” in the lower-right corner.

12.1.8 Modify Color Scaling

The color guide at the right of and above the plot defines the mapping between the colors and the distance plot. It is possible to narrower the window of color by clicking and dragging over a region of the color guide.

If you Shift-click and drag over a region of one axis, both axes change symmetrically (i.e., use the same measure) and have an identical range.

To undo or reset the color scales, both pull down menus for Distance/Response allow you to either: reset this scale, reset symmetrically, or reset all scales

12.1.9 Effect of Clicking on the Plot

Click on a point within, or on a observation along the bottom or side of the plot. A table of statistics for the observation pair is generated in the lower left corner of the window (as seen in Fig. 12.6.

In some plots, more than one observation fits within one pixel. In this case, a star will appear in the lower left corner display after the observation name. The arrow keys on the keyboard may be used to maneuver through the values. (When this occurs, the values are averaged over the pixel. Right-mouse-button zooming is recommended to get a better visualization under this circumstance.)

12.1.10 Color Drop Down Menu

The bottom drop down menu allows you to exercise your artistic side and choose from the following color combinations to be applied to the Distance Matrix:
Multi-color displays the matrix with a full rainbow color scheme.
Blue-red is the default display and creates the matrix in a blue and red color continuum.
white-red is a colorful and yet conservatively monotonic color combination.
Black-white is the least likely presentation of the Distance Matrix to fool someone with color blindness.
OK, we said exercise your artistic side, not necessarily fulfill all avenues of expression, but we apologize for any artistic frustration this menu may have engendered.