‹‹ Back to SVS Home

17.5 Linear Regression with Continuous Response

17.5 Linear Regression with Continuous Response

Linear regression on a continuous response from one or more binary, continuous, or categorical variables and/or first-order interactions between these variables is provided in ChemTree as an alternative to recursive partitioning on a linear response.

If a variable is categorical, dummy variables will be used, one for each category. Each dummy variable will take on the value “1” if the observation takes on its corresponding category, and “0” otherwise. When a regression is done, the last category’s variable is normally dropped to avoid a rank-deficient matrix in the regression.

A first-order interaction term is a “new” variable created from the product of two spreadsheet variables. In the case of one variable being categorical, that variable’s dummy variables for its categories will each be multiplied by the other variable to create a first-order interaction term. In the case of both being categorical, their dummy variables will be multiplied by each other.

17.5.1 Methodology

The p-value for linear regression on a continuous response with n observations is obtained as follows:

        ∑
redss = (-ni=1yi)2, wherey isthe i- th response,
tcss = ∑n  ny2- redss,  i
mss = ∑ni=1yiβTx , whereβ isthe regression matrixand x are thecoefficients,
regss = mia=x1(miss i- redss,0),                      i
errss = tcss- regss,
reg_df = k, wherekisthenumberofregressors,
err_df = n - reg_df - 1, and
              (regss×err_df            )
p = aP = Ftest errss×reg_df,reg_df,err_df  .

(Except for those circumstances where mss is less than redss, this is equivalent to the F-test for linear regression

              (    2                       )
p = aP = Ftest  --R--×2-err_df---,reg_df,err_df  ,
                (1 - R )× reg_df

where R2 is the coefficient of determination for the linear regression.)

17.5.2 Stepwise Regression

It could be that only a few potential variables really affect the outcome. If this is suspected to be the case, then stepwise regression can be appropriate.

Starting with the null model, successive models are created, each one using one more regressor than the previous model.

To pick which regressor to use for the next model, each of the unused regressors in turn is tried out by adding it to the current model. The P-value of the trial model as a “full model” vs. the current model as a “reduced model” is found, and the model with the best (smallest) P-value found this way is used. However, if no P-value is better than the “P-value cutoff” that was specified, the stepwise method stops, and declares the current model as the end result. (Of course, the stepwise method will also stop if all possible regressors have been used up.)

To find the significance of each “full model” vs. its corresponding “reduced model”, an F-test is done. Specifically, if for n observations, regssf, errssf, regssr, and errssr are defined for the full model and the reduced model, respectively, the same way as regss and errss are for performing a simple regression (17.5.1), then the F-test is

        ( (             )           )
            regssf-- regssr × df2
p = F test(--errs(sf(---errersgssrsr))----,1,df 2) ,
                1+  errssr

where df2 = n - numFull - 1 and numFull = the full-model size.