Normally Distributed Response Categorical Predictor
This section will typically apply when doing analysis of a continuous variable in the recursive partitioning tree using either genetic or categorical predictors. This section will also apply to an allele test on a continuous variable, except that the number of observations is doubled (one observation for each allele).
The same dynamic programming method is used to find the optimal k-way split as for the Continuous/Ordinal predictor. However, first the predictor categories are ordered from lowest to highest in terms of their mean responses before segmenting is performed. Suppose you had a predictor that can take on the values {black, blue, pink, green}. Suppose the means of the observations in those classes satisfy {pink ≤ green ≤ black ≤ blue}. A split might put {pink, green} together and {black, blue} together. However, you would never see pink together with black, if pink was not also with green. Therefore it suffices to order them from lowest to highest and then segment them as if they were ordered numerically on a number line.
The raw p-value is calculated exactly the same as for the continuous predictor as described in Continuous Ordinal Predictors (26.3). As with the continuous predictor, a multiplicity-adjusted F-test is used to derive the adjusted split significance, “aP”. However, the multiplicity adjustment is different from the continuous case because of the sorting.
Let there be n observations split into k subgroups with D unique values of the categorical predictor. Let F0 be the sum of squared deviations from the mean over all responses. Let Fk be the sum over all segments of the squared deviations from the mean responses of each respective segment.
If D=k, then aP=p, otherwise:
let X = log100D
let Z = (log(D)-log(k))/log(100)
let a = 2.9002 - (9.5043 X) + (21.935 X2) - (0.46624 k) + (0.00815 k2) + (0.0053 k X)
let b = exp( - Z (0.1886 + 0.3621 log( k - 1.61 ) ))
aP = min(exp( a + ( b * log( p ))), 1)