7.2 Setting Options for Tree Analysis
|
The menu choices Tree->Options opens up a two tabbed dialog that allows us to effect how calculations are performed and what values are displayed in the individual nodes.
7.2.1 The Tree Tab
The tree tabs allows you to set parameters that effect how values are calculated. These values control how nodes are split, how calculations are effected by missing values, etc.
7.2.1.1 Minimum Elements per Child:
This feature counteracts the recursive partitioning tendency to pull off small, possibly outlier groups. When this is set greater than one, splits are computed so all child nodes will have at least that minimum number of observations.
7.2.1.2 Segmenting Algorithm:
This can be either an exact O(n2) algorithm, or an approximate O(n1.5) algorithm. For multi-way splits, computing the segmenting algorithm can be time consuming for large sample sizes. The approximate algorithm may occasionally give a slightly sub-optimal split, but due to its rarity in practice, it is the default.
7.2.1.3 Max Segments:
This is the maximum cardinality of a multi-way split, with a default of 3. While there is no limit to the cardinality you may specify for a split, using this default gives the best prediction performance across a broad class of datasets. High cardinality models are not necessarily better because sample size can be lost for successive splits. In addition, overbuilding trees using a higher cardinality can result in spurious split segments, for instance, a fourth or fifth segment where three segments should have been sufficient.
If a path length is the splitter, the three segments can be path lengths that are too low, path lengths that are optimum, and path lengths that are too high. More segments would not realistically be needed to isolate this optimum path length.
7.2.1.4 Parallel Threads:
This setting allows the specification of the number of concurrently running threads for calculating splits on multiprocessor machines. If you have a multiple processor machine, this can markedly improve performance on very large data sets.
7.2.1.5 Resampling Iterations
This defines the number of iterations used to estimate p-values using the resampling approach. The smallest bound you can place on a p-value is 1 divided by the number of resampling iterations.
7.2.1.6 P Value Threshold:
The radio button for selecting RawP, AdjustedP or Bonferroni adjustedP determines which type of p-value is used when disallowing any split whose p-value is greater than the threshold p-value, or when the manual split window is determining when to call a split not significant.
In multiway splitting, the Multiway split pairwise P value threshold is also used as the significance level on pair comparisons between adjacent nodes. If any adjacent node in a multiway split is not significantly split according to this threshold, the cardinality of the split is reduced.
Normally, the default is to use the Bonferroni adjusted p-value as the split stopping criteria, which has the lowest chance of over-fitting. However, when doing variable selection, it might be advisable to use the raw or adjusted p as the criteria for stopping splitting, as the Bonferroni adjusted p-value may be over conservative with large numbers of predictors for this purpose.
NOTE: for best prediction performance on multiple trees with the Bonferroni p-value threshold, larger p-value thresholds are better – even as high as 0.99. However if you want to make the threshold lax, set max segments to 2 or 3 . Using the RawP or AdjustedP as the splitting criteria (these are now available) allows tree over building using a less lax p-value threshold than if one uses the Bonferroni threshold.
7.2.2 The Node View Tab
|
This tab allows you to turn ON or OFF the statistical and p values displayed in the tree nodes. To turn ON a value’s display select it using the check box.
7.2.2.1 What the Node Values Mean
|
The following table explains the meaning of the node symbols:
| Symbol | Meaning |
|---|---|
| BP_I | Response Variable Name (only shows in root node) |
| n | Node sample size |
| u | Node mean. (For multivariate display, the means of each of the dependent variables will be listed as u1, u2, and so forth.) |
| s | Node standard deviation. (For multivariate display, the standard deviations of each of the dependent variables will be listed as s1, s2, and so forth.) |
| se | Node standard error. (The multivariate display will list standard error for each dependent variable.) |
| mse | Node mean squared error. (The multivariate display will list mse for each dependent variable.) |
| P | The p-value calculated as a T-test or F-test or Chi-squared test, as appropriate. This field is only used in “parent” or non-terminal nodes. |
| aP | The p-value multiplicity adjusted for the number of possible cut-points searched through in the case of continuous, ordinal, or categorical predictors. For binary predictors and regression fits, aP=P as there is no multiple testing. This p-value is not adjusted for the multiplicity of independent variables searched through to find the split. |
| bP | The Bonferroni-adjusted p-value of the split. It is the adjusted p-value (aP) multiplied by a Bonferroni correction, which is equal to the number of independent variables that were searched through to find the split. Note that if a given independent variable cannot serve as a splitter because all instances are identical, then that variable will not contribute to the Bonferroni correction. For this reason, the correction factor will either stay the same or get smaller as you go deeper into the tree. |
| rsP | Resampled p-value. This is a p-value calculated by a statistical resampling methodology. This p-value only appears if the user has calculated it using the Resample node menu item. |
| RsbP | Bonferroni-adjusted resampled p-value. This is rsP multiplied by the same Bonferroni correction as is multiplied by aP to get bP. |
| Nn | This is the node identifier. The root node is denoted as “N”. If there were 3 children of the root, they would be denoted N1, N2, and N3. Children of N2 would be denoted N21, N22, and so forth. |
| I | At the bottom right of a node is a little “I”, which if clicked will pop up a text editor to annotate the node. If the tree is saved to disk, the annotations are recovered when the tree is next loaded. Nodes that have annotations will display the “I” in pink, otherwise in black |