Working with Nodes


[Picture]
Figure 7.6: The node’s pop up menu showing the various actions that may be performed on a node.

7.3.1 Node Pop-up Menu Selections

The following sections detail the pop-up menu selections:

7.3.1.1 Split Node


[Picture]
Figure 7.7: A single split performed using the Split Node menu command

The Split Node command finds and performs the most significant split over all independent variables in that node. If there are equally significant splits, the first one found sweeping left to right through the columns is performed. If no split is found whose p-value (of the selected type) is less than the threshold set in the Tree->Options menu, then no split is performed and a dialog box appears with the message No split found.

7.3.1.2 Manual Split


[Picture]
Figure 7.8: The Manual Split dialog box listing available split options

Selecting Manual Split opens a dialog that allows you to pick which independent variable to split on. This very powerful HelixTree feature is examined in detail in Section 7.4, Manually Splitting Nodes. Clicking on the top split option in Fig. 7.8 opens the following tree:


[Picture]
Figure 7.9: A tree with after one Manual Split.

7.3.1.3 Collapse/Expand

A large tree view can get a bit cluttered. One way to temporary reduce all the visual information is to collapse some nodes. Select the node to collapse (it must have child nodes), and choose Collapse/Expand from the menu.


[Picture]
Figure 7.10: The parent node’s menu showing the Collapse command

Once the child nodes are collapsed, the parent node’s text turns blue to indicate that there are collapsed nodes under blue text node. To expand repeat the Collapse/Expand menu choice.


[Picture]
Figure 7.11: The parent node with blue text indicating collapsed child nodes

7.3.1.4 Recursive Split

HelixTree can calculate the most significant split for a node or we can manually decide which independent value to split on.

Clicking on the Recursive Split menu command from a selected node causes HelixTree to recursively continue to split nodes with the stopping criteria being that the p-value (of the selected type) of a new split has to be less than or equal to the significance threshold set in the Tree->Options menu. At each node the splitter with the smallest p-value (of the selected type) will be selected to split the node.

This will create a greedy tree built by choosing the most significant split found at each successive stage of the splitting. The terminating nodes from this process are leaf nodes. This means there are no significant splits left below the p-value threshold we set in the tree options.

7.3.1.5 Spreadsheet


[Picture]
Figure 7.12: A spreadsheet for the data contained in one node

You can take all the records represented by a node and turn them into a new spreadsheet. As with any other spreadsheet, you can save and manipulate this data.

7.3.1.6 Resample


[Picture]
Figure 7.13: The node Resample menu options for non-leaf nodes

This option will only be available on non-leaf nodes where the split is not a regression “split”. The user then has the choice of computing a resampled estimate of the p-value for the Current Node, the Subtree, or the entire tree using All Nodes.

The resampling approach is to first calculate the statistic for the optimal split, S0. Then the algorithm creates thousands of splits of the same cardinality and same daughter sample size, where the observations are shuffled at random. A statistic, S1..SN is calculated for each of these randomly shuffled splits, and the p-value is calculated as the number of Si’s greater than or equal to S0, divided by N, the number of iterations.

Note that this approach has some deficiencies. The ideal way to resample would be to randomly permute the observations and apply the same split-finding methodology to find the optimal split on the random data as was used to find the optimal split, then rank the p-values of those splits against the optimal one. Nevertheless the approach we provide is still useful, particularly when the split does not involve many values for the predictor. Also, the approach we provide is correct for the case of calculating resampling p-values for holdout samples. That is, when you have built a model on a training set, and then wish to look at the p-values of dropping the holdout samples down the tree. In this case the resampled p-value without Bonferroni adjustment is a good significance estimate.

The number of iterations is a parameter that can be set in the tree options menu. Larger numbers give more accurate p-values. A resampled p-value can only be as small as 1 divided by the number of resampling iterations. Hence resampling is most useful deeper in the tree where the number of observations is smaller and the p-values are borderline significant. If the resampled p-values are not visible on your tree after computing them, you must go to the tree options menu and turn on rsP and rsbP in the tree node display.

7.3.1.7 Visualize Genetics


[Picture]
Figure 7.14: The node Visualize Genetics menu options

This menu item is only useful if genetics variables are included in the data set. There are five possible options, not all of which are always available:

  1. Hardy Weinberg Equilibrium Plot (15.1.1)
  2. Linkage Disequilibrium Plot (14)
  3. Two-Loci Genetic P-Value Plot(20)
  4. Haplotype Trend Regression (17)
  5. Display Allele Table (17.2)

7.3.2 Visualize Split Data

This menu provides a number of ways of visualizing your data.


[Picture]
Figure 7.15: The node Visualize Split Data menu options

7.3.3 Visualize Split Data->Multiple Tree Clustering

There are three variants on the multiple tree clustering plot, also called the Observation Distance Matrix. The matrix is a dissimilarity plot of each observation in the node with every other observation in the node. The dissimilarity is calculated based on how similar the observations are in model space using a multiple tree model. This matrix may be ordered three different ways: unsorted (spreadsheet order), ordered by the first principle component of the dissimilarity matrix, or sorted by similarity to one of the observations in the node.


[Picture]
Figure 7.16: The Multiple Tree Clustering submenu options

An example observation distance matrix is shown in Fig. 7.17. As we will see later in Section 9, Random Tree Generation, we can create cluster plots and store them in a file.

This menu choice prompts the user for a multiple tree model, and then drops the set of observations down the trees in the tree file and calculates and visualizes a similarity matrix for the observations. This multiple tree clustering methodology is also available from the random tree browsing view in Section 9.3.


[Picture]
Figure 7.17: The Distance Matrix from the Multiple Tree Clustering option

This is the result of loading the multiple tree model. We will study this in Chapter 12, Observation Distance Matrix.

7.3.4 Visualize Split Data->Histogram

We can use this menu choice to view a histogram of this node and any child nodes.


[Picture]
Figure 7.18: A histogram showing histograms for a parent and several child nodes

The top histogram represents the parent node. Subsequent histograms represent the children of the parent from left to right. When you click on a daughter histogram, it is highlighted in magenta, and the contribution to the parent is also highlighted in the parent histogram as shown above. We will cover histograms in detail in Chapter 11, Histogram Node Analysis.

7.3.5 Showing Split Data


[Picture]
Figure 7.19: The Show Split Data submenu choice highlighted

The Show Split Data menu choice opens up a plot of the given split variable versus the response variable. A similar plot is shown in the manual split window, only the plot shown here does not allow interactive modification of the split.

This graph reflects the type of split done on a node. Some interesting types of splits that are not viewable from the Define Split functionality of the Manual Split window (but are viewable here) include the linear and logistic regression splits. To read more about the plot window and its options see Section 7.5, Defining Splits.

7.3.5.1 Linear Regression Splits


[Picture]
Figure 7.20: The Split View window with a linear regression graphed

The Show Split graph will plot the data points, a moving average of the data point response values over a selectable window size, and the line whose equation is was determined by performing the linear regression. Note that linear regression can only be performed with continuous dependent variables.

NOTE: This graph may be zoomed using the right mouse button. See Section 7.5.5, Define Split, Zooming into a Specific Region. To restore the original graph, press the Reset View button.

7.3.5.2 Logistic Regression Splits


[Picture]
Figure 7.21: The Split View window with a logistic regression graphed

This Show Split Data graph plots the logistic curve over the data points for the regression, in addition to the data points and a line representation of the points’ average over a selectable number of values. With binary responses, a logistic curve often represents the function that best fits the data. Logistic regressions can only take place with binary dependent variables or categorical dependent variables that contain only two categories.

NOTE: This graph may be zoomed using the right mouse button. See Define Split, Zooming into a Specific Region, 7.5.5. To restore the original graph, press the Reset View button.

7.3.5.3 Splits on Continuous Predictors


[Picture]
Figure 7.22: The Split View window with a split on continuous predictors graphed

This Show Split Data graph will plot the data points and a line representation of the points average over a specified number of values. The purple line represents the split point. If you would like to modify a split point, you must do so from the manual split window described in Section 7.5.

NOTE: This graph may be zoomed using the right mouse button. See Section 7.5.5, Define Split, Zooming into a Specific Region. To restore the original graph, press the Reset View button.

7.3.5.4 Splits on Categorical Predictors


[Picture]
Figure 7.23: The Split View window with a split on categorical predictors graphed

The Show Split Data graph plots a bar representations of the various categorical values, separated by a space that signifies the location of the split. Binary data will also be represented in this way, categorizing each response on opposite sides of the split. The variation in height of the predictors can provide a visual guideline as to the significance of the split. If you would like to modify the categorizing of the various categorical predictors, you must do so from the manual split window described in section 7.5.

Also displayed in this graph is the odds ratio, if either binary data is being split or categorical data is being partitioned into exactly two aggregate categories. In this case, the upper left and upper right of the resulting window will each display an odds ratio.

The upper left display shows the ratio of the odds of a case falling into the category or set of catogories shown in the left of the main part of the display to the odds of a control falling into that or those categories.

The upper right display shows the ratio of the odds of a case falling into the right-displayed category or categories to the odds of a control falling into the right-displayed category or categories.

The upper left display is also equal to the ratio of the odds of something falling into the left-displayed category or categories being a case to the odds of something falling into the right-displayed category being a case.

The upper right display is also equal to the ratio of the odds of something falling into the right-displayed category or categories being a case to the odds of something falling into the left-displayed category being a case.