7.2 Setting Options for Tree Analysis


[Picture]
Figure 7.2: The Tree->Options menu Tree tab option

The menu choices Tree->Options opens up a two tabbed dialog that allows us to effect how calculations are performed and what values are displayed in the individual nodes.

7.2.1 The Tree Tab

The tree tabs allows you to set parameters that effect how values are calculated. These values control how nodes are split, how calculations are effected by missing values, etc.

7.2.1.1 Minimum Elements per Child:

This feature counteracts the recursive partitioning tendency to pull off small, possibly outlier groups. When this is set greater than one, splits are computed so all child nodes will have at least that minimum number of observations.

7.2.1.2 Segmenting Algorithm:

This can be either an exact O(n2) algorithm, or an approximate O(n1.5) algorithm. For multi-way splits, computing the segmenting algorithm can be time consuming for large sample sizes. The approximate algorithm may occasionally give a slightly sub-optimal split, but due to its rarity in practice, it is the default.

7.2.1.3 Max Segments:

This is the maximum cardinality of a multi-way split, with a default of 10. There is no limit to the cardinality you may specify for a split. However, most data sets don’t support 50-way splits, so it is often advisable to save computing time and not search through high cardinality splits. The exact and approximate algorithms respectively run in time proportional to O(kn2) and O(kn1.5) where k is the maximum split cardinality.

7.2.1.4 Parallel Threads:

This setting allows the specification of the number of concurrently running threads for calculating splits on multiprocessor machines. If you have a multiple processor machine, this can markedly improve performance on very large data sets.

7.2.1.5 Resampling Iterations

This defines the number of iterations used to estimate p-values using the resampling approach. The smallest bound you can place on a p-value is 1 divided by the number of resampling iterations.

7.2.1.6 P Value Threshold:

The radio button for selecting RawP, AdjustedP or Bonferroni adjustedP determines which type of p-value is used when disallowing any split whose p-value is greater than the threshold p-value, or when the manual split window is determining when to call a split not significant.

In multiway splitting, the Multiway split pairwise P value threshold is also used as the significance level on pair comparisons between adjacent nodes. If any adjacent node in a multiway split is not significantly split according to this threshold, the cardinality of the split is reduced.

Normally, the default is to use the Bonferroni adjusted p-value as the split stopping criteria, which has the lowest chance of over-fitting. However, when doing variable selection, it might be advisable to use the raw or adjusted p as the criteria for stopping splitting, as the Bonferroni adjusted p-value may be over conservative with large numbers of predictors for this purpose.

NOTE: for best prediction performance on multiple trees with the Bonferroni p-value threshold, larger p-value thresholds are better – even as high as 0.99. However if you want to make the threshold lax, set max segments to 2 or 3 . Using the RawP or AdjustedP as the splitting criteria (these are now available) allows tree over building using a less lax p-value threshold than if one uses the Bonferroni threshold.

7.2.1.7 Use Missing Values as Predictors

By default this is turned ON and uses missing values to predict the response. In regular splits, this corresponds to using a missing class for double and integer predictors. When this option is turned OFF, missing values are dropped for double and integer predictors. Hence a split may have daughter nodes whose total number of observations do not add up to the parent’s total.

It is not advisable to drop missing values for the multiple tree variable correlation plot, as it will skew the proportions in the trees. For linear regression, this setting has no effect as missing values are never used in the regression, and the value of the parent mean is used as the prediction for those observations that are missing.

7.2.1.8 Linear/Logistic Regression:

This tells whether to use linear regression fits for continuous and ordinal predictors on continuous and ordinal responses, or whether to use logistic regression fits for continuous and ordinal predictors on a binary response (or a categorical response with just two categories).

With linear regression, a line fits the response in terms of a single predictor, and a p-value is computed for goodness of fit. Instead of splitting the data, a single node is dropped beneath the parent, containing the residual of the fit. If there are missing values in the predictor, the response will be fit as the mean of the parent node for those observations.

With logistic regression, a logistic (sigmoid) curve fits the binary response in terms of a single continuous or ordinal predictor. A p-value is computed for goodness of fit. Instead of splitting the data, a single node is dropped beneath the parent, containing the residual of the fit. Beware, that the residual is continuous, so that subsequent splits are on a (usually bimodal) continuous distribution, which is not always amenable to hypothesis testing based on normality assumptions.

7.2.1.9 RP Splits:

RP splits by default is ON. If you want to disallow RP splits, perhaps to explore only linear regression relationships, you can uncheck this box.

7.2.2 The Node View Tab


[Picture]
Figure 7.3: The Tree->Options menu Node View tab option

This tab allows you to turn ON or OFF the statistical and p values displayed in the tree nodes. To turn ON a value’s display select it using the check box.

7.2.2.1 What the Node Values Mean


[Picture]
Figure 7.4: A picture of a node with many of its values turned ON

The following table explains the meaning of the node symbols:


Symbol Meaning
BP_I Response Variable Name (only shows in root node)
n Node sample size
u Node mean. (For multivariate display, the means of each of the dependent variables will be listed as u1, u2, and so forth.)
s Node standard deviation. (For multivariate display, the standard deviations of each of the dependent variables will be listed as s1, s2, and so forth.)
se Node standard error. (The multivariate display will list standard error for each dependent variable.)
mse Node mean squared error. (The multivariate display will list mse for each dependent variable.)
P The p-value calculated as a T-test or F-test or Chi-squared test, as appropriate. Regression p-values, are computed using a log-likelihood ratio test. This field is only used in “parent” or non-terminal nodes.
aP The p-value multiplicity adjusted for the number of possible cut-points searched through in the case of continuous, ordinal, or categorical predictors. For binary predictors and regression fits, aP=P as there is no multiple testing. This p-value is not adjusted for the multiplicity of independent variables searched through to find the split.
bP The Bonferroni-adjusted p-value of the split. This is the adjusted p-value (aP) multiplied by a Bonferroni correction, which is equal to the number of independent variables that were searched through and that actually could be split. Independent variables used for regression will be counted additionally to those used for segmenting. If an independent variable is used successfully for both regression and segmenting, it will be counted twice. Note that if a given independent variable cannot serve as a splitter because all instances are identical, then that variable will not contribute to the Bonferroni correction. For this reason, the correction factor will either stay the same or get smaller as you go deeper into the tree.
rsP Resampled p-value. This is a p-value calculated by a statistical resampling methodology. This p-value only appears if the user has calculated it using the Resample node menu item.
RsbP Bonferroni-adjusted resampled p-value. This is rsP multiplied by the same Bonferroni correction as is multiplied by aP to get bP.
Nn This is the node identifier. The root node is denoted as “N”. If there were 3 children of the root, they would be denoted N1, N2, and N3. Children of N2 would be denoted N21, N22, and so forth.
I At the bottom right of a node is a little “I”, which if clicked will pop up a text editor to annotate the node. If the tree is saved to disk, the annotations are recovered when the tree is next loaded. Nodes that have annotations will display the “I” in pink, otherwise in black


Symbol Meaning
BP_I Optional Response Variable Name
n Node sample size
u Node mean. (For multivariate display, the means of each of the dependent variables will be listed as u1, u2, and so forth.)
s Node standard deviation. (For multivariate display, the standard deviations of each of the dependent variables will be listed as s1, s2, and so forth.)
se Node standard error. (The multivariate display will list standard error for each dependent variable.)
mse Node mean squared error. (The multivariate display will list mse for each dependent variable.)
P The p-value calculated as a T-test or F-test or Chi-squared test, as appropriate. Regression p-values, are computed using a log-likelihood ratio test. This field is only used in “parent” or non-terminal nodes.
aP The p-value multiplicity adjusted for the number of possible cut-points searched through in the case of continuous, ordinal, or categorical predictors. For binary predictors and regression fits, aP=P as there is no multiple testing. This p-value is not adjusted for the multiplicity of independent variables searched through to find the split.
bP The Bonferroni-adjusted p-value of the split. This is the adjusted p-value (aP) multiplied by a Bonferroni correction, which is equal to the number of independent variables that were searched through and that actually could be split. Independent variables used for regression will be counted additionally to those used for segmenting. If an independent variable is used successfully for both regression and segmenting, it will be counted twice. Note that if a given independent variable cannot serve as a splitter because all instances are identical, then that variable will not contribute to the Bonferroni correction. For this reason, the correction factor will either stay the same or get smaller as you go deeper into the tree.
rsP Resampled p-value. This is a p-value calculated by a statistical resampling methodology. This p-value only appears if the user has calculated it using the Resample node menu item.
RsbP Bonferroni-adjusted resampled p-value. This is rsP multiplied by the same Bonferroni correction as is multiplied by aP to get bP.
Nn This is the node identifier. The root node is denoted as “N”. If there were 3 children of the root, they would be denoted N1, N2, and N3. Children of N2 would be denoted N21, N22, and so forth.
I At the bottom right of a node is a little “I”, which if clicked will pop up a text editor to annotate the node. If the tree is saved to disk, the annotations are recovered when the tree is next loaded. Nodes that have annotations will display the “I” in pink, otherwise in black