‹‹ Back to SVS Home
Plotting Haplotype Regression
17.1 Plotting Haplotype Regression
For the science behind haplotype trend regression, please refer to Sections 26.6 and 26.7 from Chapter 26 Formulas and Theories.
17.1.1 Haplotype Regression
|
In order to do a Haplotype Trend Regression (HTR) the option must first be enabled. Once a tree has been created from the spreadsheet data, click on the menus Tree->Options from the tree view. An Options window opens (Fig. 17.1). From the Tree tab, select the check box next to Haplotype Trend Regression. Options relating to HTR will then be enabled. You may use these to choose the moving-window size (and other moving-window options if you are using marker mapping) and how the haplotypes will be estimated.
If you wish HelixTree to infer missing genotype values from the existing haplotypes, check the Use missing values as predictors (impute missing data for HTR) box.
To eliminate other split types such as Genotype splits and Non-genetic splits from being possible, uncheck their boxes.
Please see (7.2) Setting Options for Tree Analysis for more details about setting tree options.
NOTE: To change the default tree options for the whole HelixTreeproject or globally (for all HelixTreeprojects), see (3.5.3.1) Tree Options.
17.1.2 Obtaining a Haplotype Regression
Next, a haplotype trend regression “split” must be obtained. Beginning with a tree view (), left-click or right-click with the cursor over a non-leaf tree node to open a context menu. Click Manual Split to get a manual split display, such as shown in Fig. 17.2.
|
Select the desired haplotype trend regression “split” from this display. (Fig. 17.2 shows an HTR “split” being selected from a manual split window.)
NOTE: Using “Split Node” or other methods may also be used to induce an HTR “split”. However, the advantage of the manual split window is the ability to obtain a p-value plot by moving window position. (Click Plot P Values by Var # to get this.)
17.1.3 Viewing the Details of the Regression
|
Once the HTR “split” has been chosen, you may look at either the regression statistics or a residual spreadsheet. To do this, left-click or right-click in the same tree view with the cursor over the original tree node to open a context menu. Click Visualize Genetics-> Haplotype Regression-> Regression Statistics to see the regression statistics, or Click Visualize Genetics-> Haplotype Regression-> Residual Spreadsheet to see the residual spreadsheet.
17.1.4 Residual Spreadsheet
This spreadsheet will contain the actual, predicted, and residual values for each sample, as well as the estimated haplotype values for each haplotype. The residual value of a sample is defined as the difference between the sample’s actual value and its predicted value from the regresssion.
NOTE: Strictly speaking, residuals do not make as much sense when the dependent is binary, because their distribution separates into two parts. However, HelixTree shows this spreadsheet anyway to allow seeing the individual haplotype frequency estimates. In addition, it allows seeing a crude gauge as to how well the the regression is predicting the dependent variable.
While with linear regression, you might choose to further regress upon a residual, and this will be reasonably meaningful, the above comment illustrates why HelixTree also has a more sophisticated (optional) regression feature (24). Among other things, this feature allows you to “correct for” covariates, even if both the dependent and the covariates are binary.
17.1.5 Regression Statistics
|
A text viewer will come up to show the haplotype trend regression statistics. Portions of text in this viewer may easily be copied for pasting elsewhere.
The following statistics are shown in this viewer:
17.1.5.1 Overall Statistics
If the dependent variable is continuous, the following overall statistics are shown:
- The name of the response variable.
- The multiple correlation coefficient R. (Square root of R2.)
- The coefficient of determination R2.
The coefficient of determination R2 is computed as

where sse is the sum square of errors (sum of squares of predicted minus actual values) and sst is the sum square of totals (sum of squares of the dependent average minus the actual values). This statistic is sometimes thought of as the amount of variation of the dependent variable “explained” by the independent variables.
- The adjusted-R2 statistic. Adjusted R2 is meant to compensate for many regressors each “explaining” small
portions of the variation by chance.
This statistic is computed as

where N is the sample size and k is the number of regressors. (In some cases, adjusted R2 may be negative.)
- The sample size. The sample will only be the subset covered by the originating tree node.
- The standard error of the estimate. This is computed as

where se is the standard error (of the estimate), sse is the sum square of errors (sum of squares of predicted minus actual values), n is the sample size, and reg_df is one less than the number of haplotypes.
- The standard deviation of the response.
- The F-statistic.
- The p-value from the regression.
- The regression degrees of freedom.
- The residual degrees of freedom.
- The total degrees of freedom.
If the dependent variable is binary, the following overall statistics are shown:
- The name of the response variable.
- The regression likelihood.
- The null model likelihood.
- The sample size.
- The regression chi-square statistic.
- The regression p-value.
- The regression degrees of freedom.
- The residual degrees of freedom.
- The total degrees of freedom.
17.1.5.2 Regressor Statistics
After the y-intercept for the regression is displayed, the following statistics are displayed for each regressor:
- The regressor, which will be a haplotype.
- The regression coefficient for this regressor.
- (Continuous dependent only:) The standard error for this regressor. To compute this, a regression is taken with
all the regressors but this one as independents and with this regressor as a substitute dependent variable. If
ssr is the sum of squares of this regressor’s actual values minus this regressor’s average, and Rr2 is the R2
value obtained from this regression-against-the-regressor, and the standard error of the estimate is se, then the
standard error of the regressor sr will be

- (Continuous dependent only:) The value of the t-statistic for this regressor.

where β is the regressor’s regression coefficient.
- Pr(> |t|). This is the p-value from regressing using the actual full model as its full model, but using the actual full model without this regressor as its reduced model. Thus, this shows how much difference this particular regressor is making in the regression. Pr(> |t|) refers to the probability that the difference made by adding this regressor is accounted for by chance, and thus that this case could be thought of as being in one of the “tails” of the t-distribution.
- (Binary dependent only:) The odds ratio for this regression coefficient. This expresses the ratio of the odds for
a “case” if this regressor (haplotype probability) equals one to the odds for a “case” if this regressor (haplotype
probability) equals zero. Mathematically, this happens to be expressible as eβ, where β is this regression
coefficient.
NOTE: Remember that while the total probability possible for any haplotype is 1, “1” means two copies of the haplotype are homozygously present, “.5” means effectively exactly one copy is present, and “0” means no copies are present.
- Univariate Fit. This is the p-value of simply taking a regression with this haplotype regressor by itself against the dependent variable.
17.1.5.3 Left-Out Regressors
The haplotype (potential regressor) which has been left out is listed here. This will be one final haplotype of either a “normal” frequency or a “rare” frequency, or one final haplotype category consisting of the frequencies of the “rare haplotypes” aggregated together. Leaving out this final haplotype avoids the multicollinearity problem that would otherwise occur between the haplotypic regressors.
17.1.5.4 Table of Haplotypes
A table of the haplotypes (used in the regression or not) and their frequencies is shown.
17.1.5.5 Parameters
The parameters used for the regression are shown.