Observation Distance Matrix from Random Trees

Observation Distance Matrix from Random Trees

Observation distance matrix

Figure 1. Observation distance matrix.

The Observation Distance matrix, Figure 1, is suitable for analysis by external clustering or multidimensional scaling algorithms. The distance metric is based on the idea that when two observations end up together in a small subset deep within a tree, the descriptors that drive their response are shared, and hence the observations are similar in response space. Within a given tree, the distance between observation (i,j) is given by the total number of observations in the deepest node in the tree where the observations are together, divided by the total number of observations at the root of the tree. The overall distance between observation (i,j) is the average of the distances calculated for this pair over all trees in a multiple tree file.

Consider an example where there are 1000 observations in a tree, and observations i and j are split apart into daughter nodes beneath the root node. In this case, the deepest node that i and j appear together is the root node itself so the distance between them is 1000/1000=1. Suppose instead that i and j ended up being in node near the bottom of the tree that contained a total of 10 observations. Then the distance, d(i,j) = 10/1000 = 0.01. This similarity metric contrasts with other similarity metrics, in that the specific features that drive the response are the ones used to compute the distance between observations.

In the Distance Matrix view of Figure 1, the dark blue colored squares along the diagonal represent observations with high similarity.

Observation distance matrix zoom

Figure 2. Observation distance matrix zoom.

By right clicking and diagonally defining the area in the black square Figure 1, we zoom into an area of about 183 patients, who are very similar within that group, but are dissimilar with varying degrees to the other patients around it. The enlarged area, Figure 2, represents 108 patients from the larger matrix.

Square 1 and square 2 in the zoomed in of Figure 2, correspond with nodes with the red squares (square 1 and 2) of the tree in Figure 3.

Figure 3. Recursive partitioning tree

This tree was created using the manual split option. Whereas, the genetic variables may not have shown up with "the greedy tree approach", manual tree building gives users the option to split on variables of their own choosing.

By left clicking and dragging over an area of the Observation Distance matrix, we can open up a spreadsheet or a tree view of the observations that reside in a given square. The spreadsheet in Figure 4 shows a group of 23 patients from the tree node in the red square #2 of Figure 3. We see that all observations are all female smokers with homozygous 1_1 for Gen_A.

HelixTree association analysis spreadsheets

Figure 4. Resulting spreadsheet.

 

Data Import and Preparation Data Quality Control Stratification Correction Genetic Association Testing Mitigating False Positives