Linear Regression From a Tree Node

The (univariate) response, y, is fit to the given predictor variable, x, using linear regression, and a residual node is created beneath the node in the tree. We may represent the response as y = b1x + b0 + ε, with the model itself being the line expression b1x + b0 and the error term, ε, expressing the difference, or residual, between the model of the response and the response itself. If further splits and/or regressions are done upon this node, this ε term becomes effectively the “dependent variable” for those splits and/or regressions.

For missing values of the predictor, the prediction becomes the mean of the parent node.

The p-value is calculated using a likelihood ratio test. We calculate the sum of squared residuals as follows: SSR = i=1n[yi - (b0 + xib1)]2. Let the mean of the xi‘s be x, and the sum of squared deviations from the mean be given by : SS = i=1n[xi - x]2. Then

T = ∘---b1-----
      SSR∕(n--2)-
         SS

and the p-value is given by the two-sided Student’s t distribution with n-2 degrees of freedom:

p = 2StudentT(|T|,n - 2).