Results from Logistic Regression (Optional Module)

NOTE: Sometimes a logistic regression attempt will fail. This can be because of insufficient rank in the matrix due to there not being enough observations or due to some of the regressors being “colinear”, that is, being linear combinations of another regressor or regressors and therefore not able to “present new data” to the regression.

When doing haplotype regression on a spreadsheet which has some rows inactivated, a regression can also fail when some haplotypes are only present in the inactivated rows.

However, see 26.18.9 for other causes of regression failure much more specific to logistic regression.

26.18.1 P-value Plot

In the case that a moving HTR window was specified, a p-value plot is output. The p-value at any specific plot location is from the regression which was done for the window which begins at the indicated marker, and is the full-model-vs-reduced-model p-value, if applicable, or otherwise is the regression p-value.

26.18.2 Residual Spreadsheet

In the case that a moving HTR window was NOT specified, a residual spreadsheet may be produced. This spreadsheet will contain the actual, predicted, and residual values for each sample, as well as the estimated haplotype values for haplotypes and the spreadsheet values for non-genetic regressors. The residual value of a sample is defined as the difference between the sample’s actual value and its predicted value from the regression.

NOTE: Strictly speaking, residuals do not make as much sense for logistic regression as they do for linear regression because the distribution of a logistic regression residual separates into two parts. However, HelixTree shows this spreadsheet anyway to allow seeing the individual haplotype frequency estimates, as well as any covariates and interaction terms used. In addition, it allows seeing a crude gauge of how well the the regression is predicting the dependent variable.

26.18.3 Logistic Regression Statistical Output Viewer

As detailed in 24.2.6, a statistical output viewer will be displayed for a single regression, either when run directly or when invoked from a point in the p-value plot.

26.18.4 Overall Statistics

If a full vs. reduced model is being used, the following overall statistics are displayed for both normal and stepwise regression:

  • The name of the response variable.
  • The full model likelihood. For stepwise regression, “full” means only using the regressors chosen from the stepwise procedure (plus the reduced-model covariates).
  • The reduced model likelihood.
  • The chi-square statistic of the full model.
  • The chi-square statistic of the full model vs. the reduced model.
  • The p-value from the full model.
  • The p-value from the full vs. the reduced model.
  • The permuted P-Value, if permutation testing has been selected.
  • The number of permutations, if permutation testing has been selected.
  • Regression degrees of freedom of the full model.
  • Regression degrees of freedom of the reduced model.
  • Residual degrees of freedom of the full model.
  • Total degrees of freedom of the full model.

If only a full model without a reduced model is being used, the following overall statistics are displayed for both normal and stepwise regression:

  • The name of the response variable.
  • The regression likelihood.
  • The null model likelihood.
  • The sample size.
  • The regression chi-square statistic.
  • The regression p-value.
  • The permuted P-Value, if permutation testing has been selected.
  • The number of permutations, if permutation testing has been selected.
  • The regression degrees of freedom.
  • The residual degrees of freedom.
  • The total degrees of freedom.

26.18.5 Regressor Statistics

For all logistic regressions, the y-intercept for the full model is displayed, and for full-vs-reduced-model logistic regressions, the y-intercept for the reduced model is also displayed.

The standard error for the y-intercept (under the full model) is also displayed for all logistic regressions.
[Hosmer and Lemeshow 2000]

Then, the following statistics are displayed for each regressor:

  • The regressor, which might be either a haplotype or a covariate.
  • The regression coefficient for this regressor.
  • The standard error (under the full model) for this regressor. [Hosmer and Lemeshow 2000]
  • Pr(> |t|). This is the p-value from regressing using the actual full model as its full model, but using the actual full model without this regressor as its reduced model. Thus, this shows how much difference this particular regressor is making in the regression. Pr(> |t|) refers to the probability that the difference made by adding this regressor is accounted for by chance, and thus that this case could be thought of as being in one of the “tails” of the applicable t-distribution.
  • The odds ratio for this regression coefficient. The regression odds ratio for the coefficient β is eβ. The interpretation of this is how much (by what ratio) the odds of the dependent being one change if the given regressor changes by one unit. One example would be the ratio of the odds of being a case rather than a control for a smoker to the odds of being a case rather than a control for a non-smoker. A second example would be the ratio of the odds of being a case rather than a control for a patient having a given haplotype (homozygously with probability one) to the odds of being a case rather than a control for a patient not having that haplotype at all (probability zero).
  • Univariate Fit. This is the p-value of simply taking a regression with this regressor, all by itself, against the dependent variable. Even if the main regression is full-model vs. reduced-model, this regressor will be the only regression variable involved at all in finding this p-value.

26.18.6 Left-Out Regressors

Any potential regressors which have been left out are listed here.

All non-stepwise regressions which include haplotypes leave out one final haplotype-based regressor. This may be a haplotype of a “normal” frequency (if there is no “rare” haplotype), a “rare” haplotype, or one final haplotype category consisting of the frequencies of the “rare haplotypes” aggregated together. Leaving out this final regressor avoids the multicollinearity problem that would otherwise occur between the haplotypic regressors.

For a stepwise regression, the list of left-out regressors will include all regressors that were excluded from the final model of the regression.

26.18.7 Table of Haplotypes

If haplotypes were involved, a table of the haplotypes (used in the regression or not) and their frequencies is shown.

Additionally, there are two circumstances where additional information is shown for each haplotype:

  • If a full vs. reduced model is used for the main regression, the “Individual vs. Reduced Model P-Value” is shown for each haplotype. The full model for this p-value is derived from taking the reduced model and adding to it the haplotype being listed. The reduced model is also used as a reduced model here. (This contrasts with the method for finding the “univariate fit”, which uses no reduced model.)
  • If a full-model-only (logistic) regression is performed, imputed haplotype counts for cases and imputed haplotype counts for controls are shown, along with the total case count and the total control count. The imputation is based on Hardy-Weinberg equilibrium throughout the data, including both cases and controls.

26.18.8 Parameters

The parameters used for the regression are shown.

NOTE: If this display was made from clicking a single point in a p-value plot made from a moving window, and then clicking the View Regression Results button, the markers used for this point’s regression are shown just before this (Parameters) section, after the table of haplotypes. Otherwise, they are shown near the bottom of this section.

26.18.9 Caveats

Under some circumstances, the iteration procedure for the logistic regression will be unstable and the regression may fail, even when the matrix has sufficient rank and significant regressors are included. Such a circumstance can be when the regression tries to emulate a step function, or otherwise tries to accommodate independent values for which the dependent value is either exclusively 1 or exclusively 0.

If the regression is being done stepwise, similar circumstances resulting in instability may cause “paradoxical” phenomena such as:

  • The final regression (used to get the statistics to show) failing, even though it “is the same as” the last model tried in the stepwise regression. (Actually, the regressors in the final model can be in a different order than in the last model tried in the stepwise regression. If the problem is highly unstable, the different order may be enough to cause the failure.)
  • For some regressors, you may have Pr(> |t|) = 1. This happens where the regression fails after removing the current regressor. (Of course, this is only possible for a regressor other than the latest one that was added in).

Techniques for remedying this situation directly are being contemplated, and may be implemented in HelixTree in the future.

At this time, the best workaround is to filter out the data that causes such instabilities. For instance, if one covariate or haplotype of a regression has a coefficient above 15 or 20 or below -15 or -20 and the regressors from a stepwise regression won’t regress directly, or if a certain covariate or haplotype doesn’t regress by itself, consider splitting (doing recursive partitioning) on the covariate or the markers involved in the haplotype and doing the regression on one or more of the tree node subset spreadsheets.