Performing Analysis

To perform regression analysis, open a spreadsheet and click the Analysis-> Perform Regression menu item. This feature is currently supported for spreadsheets with only one column set as dependent. Categorical dependent columns are currently not supported.

Activating the Perform Regression menu item will open a regression window where various regression options can be changed. A list and brief description of these options is as follows:


[Picture]
Figure 24.2: Selecting Perform Regression from a spreadsheet menu

24.2.1 Haplotype Trend Regression(HTR)

If this group is checked, the regression will include haplotype data derived from the pertinent genetic markers. This option will be turned off automatically for datasets that do not contain genetic columns. Parameters relating specifically to HTR are the following:

  • Haplotype frequency estimation method This allows you to select how haplotype frequencies are estimated when performing the haplotype trend regression. Either the Composite Haplotype Method (CHM) or the Expectation/Maximization (EM) algorithm may be specified. Usually EM will narrow down the number of haplotypes that are expected to significantly occur more than the CHM, and thus simplify the regression calculations and results. EM is used as the default method. The CHM is a quick enumeration procedure for haplotype probabilities that does not assume Hardy-Weinberg Equilibrium (HWE) over the markers, whereas, the EM assumes HWE, not only in the population, but in cases and controls looked at separately . If the HWE assumption fails, CHM-based haplotype trend regression may not only outperform EM-based haplotype trend regression, but the results may have greater power. (See 26.7.)

    If you have selected EM, then you can also specify the Maximum number of EM iterations, and the EM convergence tolerance.

  • Minimum haplotype frequency This allows you to exclude haplotypes whose frequency is low. This is desirable when you have multiallelic markers, or are using a wide window size and many haplotypes are not represented in the sample, hence creating the possibility of a rank-deficient regression matrix. All haplotypes below the specified threshold will be binned together into a combined group. The default threshold is 0.01. Raising the threshold may avoid the creation of a rank deficient regression matrix, however, the downside of raising this threshold is that low probability haplotypes that may be highly predictive of the response get their probability lumped in with the rare haplotypes column, and their signal may be missed.
  • Impute missing values for haplotypes: The default setting (checked) imputes the haplotype probabilities even for the patients with missing values for genetic markers (whether by EM or CHM).

    When this is unchecked, individuals with missing values for their genotypes are excluded from the regression. You may wish to uncheck this box when missing is not at random – for instance where there are more missing values for controls than for cases.

  • Haplotype grouping When performing HTR, you can choose whether to test haplotypes in a moving window or to test haplotypes in a specific set of markers. By selecting one of the first two from the following options, a regression will be performed for each set of markers in the moving window. The results of these multiple regressions will be displayed in a P-value plot from which individual regression results can be selected and displayed.

       Moving window with fixed size specifies that a fixed number of markers (which can be specified) should be used for the moving window. Selecting this option also allows you to choose the size of the moving window and whether or not to ignore marker mapping. “Ignore marker mapping” is only applicable if a marker map has been applied to the original spreadsheet. If this option is checked, it signifies that markers should be used in the order that they appear in the spreadsheet. Unchecking this option causes the markers to be used in the order that they appear in the marker map and that the moving window will not cross over chromosome boundaries as defined in the marker map.

       Moving window using genetic distance and window size This option is only available for spreadsheets where a marker map has been applied. Selecting this option specifies that both the genetic distance and size of the moving window will define which markers are considered to be within the window. The “units” field defines a maximum genetic distance(in the units of the marker map) that the moving window will include. “Maximum window size” specifies the maximum number of markers within the maximum genetic distance that will be included in the window. The window will not cross over chromosome boundaries as defined in the marker map.

       Run once for selected markers uses the markers which have been highlighted in the adjacent list of markers in the analysis. These markers are listed in spreadsheet order, and if a marker map has been applied, the chromosome and gene information will also be displayed.

24.2.2 Use Non-Genetic Covariates

Check this box if you would like to include non-haplotype covariates, or first order interactions between non-genetic covariates in the regression. (See 26.10.)

To include a covariate in the analysis, first click the Add Covariate button. This will open a dialog allowing you to select the covariate(s) that you would like to use during analysis. Then, select the covariate(s) that you would like to include, and click Add. If you would like to add all non genetic columns as covariates, click Add All. The selected covariates will be shown in the “included covariates” list. To remove a covariate, select the covariate(s) that you would like to remove, and click Remove Covariate. This will remove the item from the “included covariates” list, and thereby be excluded from the analysis. To remove all covariates click Clear Covariates.


[Picture]
Figure 24.3: Selecting covariates

To include first order interactions, click the Add Interaction button located in the lower right corner of the group. This will open a dialog which displays two lists, each containing all of the non-genetic column names within the data. Select the term(s) from each of the two lists which you would like to include, and click Add. All selected items from the list on the left will be paired with all selected items from the list on the right, and an item for each pair will be added to the list in the lower right of the regression window. If any of the selected items in either window represent categorical columns, then sub-items representing the dummy variables used in regression for each category will be paired with the items or sub-items from the other window. (Values from each pair are multiplied to create a “new” covariate, which is then used in the regression.)

When you have added all of the interactions that you are interested in, click Close to return to the regression window. All listed interactions will be included in the analysis, so unwanted interactions must be removed in order to exclude them. To remove an interaction, select the item(s) that you want to remove and click Remove Interaction. To remove all interactions, click Clear Interactions.


[Picture]
Figure 24.4: Selecting first-order interactions


[Picture]
Figure 24.5: Regression window with included covariates and interactions

24.2.3 Type of Regression

You may choose to perform:

  • Linear or Logistic Regression: checking this option specifies that a single linear or logistic regression will take place. In the case of a binary dependent data column, the regression will be logistic, for an integer or real dependent data column, the regression will be linear.
  • Stepwise Regression: checking this option specifies that the linear or logistic regresssion should be done as a stepwise regression procedure (26.9). If you are running stepwise regression, you will also be able to specify a P-value cutoff.

24.2.4 Permutation Tests

You will also have the option of whether or not to perform permutation tests. In the case the regression is not a moving window, the dependent variable is randomly permuted a designated number of times, and the selected regression is performed for each permutation. The regression results will show the permuted P-value, which is defined as the percentile of the permutations in which the permuted-test P-value is less than or equal to the original-test P-value. When performing a regression over a moving window, a minimum P-value across the moving window of tests is calculated for each random permutation of the dependent variable. The reported permuted P-value for each set of markers in the moving window is defined and calculated as the percentile of the minimum P-values across the moving window which are less than or equal to the given observed P-value.

NOTE: As stated above, the permutation testing of HelixTree’s linear and logistic regression modules permutes the dependent variable, then runs the regressions all over again, checking the significance of these regressions. The original regression matrices are not used. This is distinct from checking the “fit” of the permuted dependent to the original regression results from a given set of regressors. The object is to see whether by chance, a different set of dependents could have had a better relationship or “fit” with the covariates and haplotypes. This is tested through performing a new regression for each permutation.

(See 26.21 for a more detailed explanation and examples of permutation testing.)

24.2.5 Create Residual Spreadsheet With Covariates

If this option is checked, a residual spreadsheet will be created along with the results view from the regression. This spreadsheet will contain the actual, predicted, and residual values for each sample, as well as the estimated frequencies for haplotypes and the spreadsheet values for the non-genetic regressors. This option is only available if you are not performing HTR, or if you are performing HTR for a specific set of markers (not a moving window).


[Picture]
Figure 24.6: Example residual spreadsheet

24.2.6 Output and Running the Regression

Click Regression to start the regression procedure (the regression window will remain open, allowing for further regressions until the “close” button is clicked). Note that sometimes the regression may fail. (See 26.11 and 26.18.)

The form of the results produced by the regression will vary according to the parameters used. If you are running a single regression (i.e. not HTR with a moving window) the results will be displayed in a text viewer that displays the coefficients and other statistics for each regressor, as well as various statistics for the regression as a whole. If the options were set to perform HTR with a moving window, then a P-value plot showing the P-values for the various window positions will be displayed. To view the results for a single window position, click the marker that you are interested in, and then click View Regression Results. This will open a results viewer for the regression done at the moving window position whose first marker is the marker you clicked. For an explanation of the statistical results shown in the viewer, please see 26.11.3 (for linear regressions) or 26.18.3 (for logistic regressions).


[Picture]
Figure 24.7: Example single regression results window


[Picture]
Figure 24.8: Moving window regression results example using the data_312 data set