17.9 Binomially Distributed Response Continuous/Ordinal Predictor

The mechanism for finding the optimal split is the same as for when the response is continuous. However, however, instead of minimizing the sum of squared deviations from the mean, the maximum likelihood equations give us a different function to minimize. If pj is the proportion of ones in segment j, and nj is the number of observations in segment j we segment the data into k subgroups that minimize the sum

       ∑k
Fk = -    [- 2nj(pjlog(pj)+ (1 - pj) log(1- pj))].
       j=1

It turns out that if the response is multivariate binary, we can sum this same metric for each dimension separately and add it to the total for each segment.

Again, there is limited theory on the choice of the number of segments. A heuristic analogous to the continuous case is used, whereby a chi-square test is done between each adjacent segment in a k-way split, and if the proportions are not significantly different, we drop down to the best k-1 way split. Once each adjacent segment is significantly separated, we do an overall goodness of fit test for the resulting split of 2 or more segments.

One can use a chi-square test to test the hypothesis that the proportions are different. This p-value, which is labeled “p=” in a node can be calculated as follows.

Let there be n observations split into k subgroups with D unique values of the continuous or ordinal predictor (not counting a possible missing value). To generalize to the multivariate case, let there be v separate binary dependent columns. Let s be the proportion of ones in the entire sample, and pj be the proportion of ones in segment j. Then F0 = -2n(slog(s) + (1 - s)log(1 - s)), and Fk is defined as above. Then X2 = F0 - Fk, and p = chisqr(X2,(k - 1)v).

As with the continuous response case this test statistic does not account for the exhaustive searching through all possible cut-points in finding the optimal set of segments. Again, using the simulation approach of Hawkins, we calculate a multiplicity- adjusted p-value as follows:

Let I=1 if there are missing values, otherwise I=0;

if((I=1 and D=k-1) or (I=0 and D=k)) then aP=p. Otherwise:
let X = log D a = 1.9517 - ( 0.488 / v ) + ( 0.1598 X ) - ( 0.723 k ) + (0.252 k X )
b = 0.914 - ( 0.0043 k ) - ( 0.00607 k X )
aP = min( exp( a + b log(p)), 1).