‹‹ Back to SVS Home
Haplotype Association Tests
8.2 Haplotype Association Tests
Haplotype Association Tests Overview
The Haplotype Association Tests window offers a straightforward way of testing for association between haplotype frequencies of marker blocks against a case/control status. Tests can be made against individual haplotypes inferred by a marker block, or all significant haplotypes on a per-block basis.
Ways of Defining Haplotype Blocks
SVS allows has a few convenient ways of defining marker blocks to be used in the association test.
- Use precomputed blocks
- Use all markers as a single block
- Use a moving window of markers
These are described in more detail below.
Use Precomputed Block
By selecting this option, you have complete control over the definition of marker blocks to be used in analysis. This option
reads from an external spreadsheet and a given block definition column.
The spreadsheet with the block columns should have the same marker names along it’s row labels that are in the current
spreadsheet as column headers. A block definition column should be a column of type Integer. Each row for the column
specifies the block number the current row’s marker is a member of. It may have missing values to indicate the marker in the
current row is not in any block.
When you have selected a block spreadsheet, you must then choose which of the valid block definition columns from that
spreadsheet you would like to use for analysis.
In a common workflow, you may wish to run the Haplotype Block Detection algorithm to produce a block spreadsheet with blocks defined algorithmically. Then open this dialog to select the resulting block spreadsheet to define the blocks to be used for Haplotype Analysis.
Use All Markers as Single Block
By selecting this option, the entire set of active markers for the current spreadsheet will be treated as a single block. This may be useful when investigating entire sets of markers produced as subset spreadsheets from a LD plot. See Using LD Plots for more information.
Use a Moving Window of Markers
By selecting this option, a set of blocks will be automatically generated based on parameters for a moving window. There are two options for the moving window – either a moving window of a fixed number of columns, or, if a marker map is applied, a dynamic moving window size based on the base pair distance between markers.
- Fixed window size: Specifies that a fixed number of markers should be used for the moving window.
- Dynamic window size in base pairs: Specifies both the genetic distance in kilo-base pairs and maximum size of the moving window. It will define which markers are considered to be within the window. The “kb” field defines a maximum genetic distance in kilo-base pairs that the moving window will include, and the “max columns” field, if used, specifies the maximum number of columns within the specified genetic distance to be included in the window. The window will not cross over chromosome boundaries as defined in the marker map. This option is only available for spreadsheets where a marker map has been applied.
Association Tests Used with Haplotype Frequencies
The following statistical tests are available in comparing the significance of the association between the selected case/control dependent variable and the haplotype frequencies:
- Chi-squared test
- Odds ratio with 95% CI
- Logistic regression
These statistics are applied in different ways based on whether the tests are done on a per-block or per-haplotype basis.
Tests Computed Per Haplotype vs. Per Block
There are two modes of computing haplotype association tests:
- Calculate per haplotype: Specifies that for each marker block, the haplotype frequencies will be computed (see How Haplotype Frequencies are Computed) and for each haplotype above the frequency threshold, the selected tests will be computed to measure the association between each haplotype and the case/control dependent trait.
- Calculate per block: Specifies that for all the the haplotypes above the frequency threshold are tested together in the association test. This measures the association of the the haplotype block as a whole with the case/control trait.
Chi-Squared Test of Haplotype Association
When selected, the chi-square test sets up a 2 ×N contingency table comparing the haplotype frequencies for the cases
vs. the controls. The values in the contingency table are based on the haplotype counts between cases in
control.
On a per-haplotype basis, the haplotype counts are computed by a summation of the values in a full frequency table (see Haplotype Tables). For each haplotype, the haplotype’s frequencies for cases and controls are individual summed up, multiplied by 2 and then placed in the first column of the contingency table. The frequencies of all the other haplotypes are then summed up into a single cases and control count, multiplied by 2 and placed in the second column of the contingency table. Given y is the current haplotype and n is the summation of all haplotypes other than y where there are N haplotypes total, the table is constructed as follows.
| Current Haplotype | Other Haplotypes | |
| Case | hcase | ncase |
| Control | hcontrol | ncontrol |
Where htrait = ∑ traith i ∗ 2 for each case and control trait and each sample i with that trait for the current haplotype.
And ntrait = ∑ x=1x=N ∑ traitI x≠h(xi) ∗ 2 for each case and control trait and each sample i with that trait and where Ix≠h is an indicator function that returns 0 when x = h.
On a per-block basis, the contingency table is constructed with N columns, one for each haplotype. Each column will be computed for the given haplotype h according to the above htrait formula.
See Statistics Available for Genotype Association Tests for details on how a chi-square statistic and p-value are computed from the contingency table.
Odds Ratio Test of Haplotype Association
When selecting the Odds Ratio Test, you will get odds ratios and the lower and upper 95% confidence bounds of the current haplotype versus the other haplotypes. The values used in the odds ratio computation are the same counts described in the above 2 × 2 contingency table.
Odds ratio tests are only available on a per-haplotype basis.
NOTE: An odds ratio is generally considered significant if both the lower and the upper 95% confidence bounds are greater than one (or both less than one for an odds ratio less than one).
See the Formulas and Theories chapter for an explanation of this statistic (Section Statistics Available for Genotype Association Tests).
Logistic Regression Test of Haplotype Association
If the Logistic Regression test is selected, a regression is performed with the case/control as the response. The binary
response, y is fit to the given predictor variables xi, using logistic regression. The results include the regression p-value and
the reportable intercepts.
In a per-haplotype test, there is the single predictor variable constructed out of the haplotype frequencies you
would find in the haplotype frequency table (see Haplotype Tables) for the given haplotype. In addition to the
p-value for the regression a β0 and β1 term and their respective standard errors are reported in their own
columns.
When computed per-block, a single logistic regression is done with N = num_haplotypes predictor variables. The
intercept for the solved regression is reported along with the p-value.
See Logistic Regression for more details on how the logistic regression is performed and resulting p-values are formed.
How Haplotype Frequencies are Computed
Because the phase of the genotypic information in SNPs are not known, haplotype frequencies must be estimated using statistical methods. Although the estimation algorithms may find many potential haplotypes, there are usually only a handful with significant frequencies in a given dataset.
The Frequency threshold parameter is used to only consider haplotypes with a estimated frequency above the threshold to reduce the variables considered in association tests. Both estimation methods allow for samples with missing genotypes to have their haplotype frequencies inferred. Select Impute missings to enable this algorithmic feature.
Currently, there are two methods for estimating haplotype frequencies (see the link below for details about the algorithms and their individual strengths). If you select the EM method, you must also provide the additional Maximum EM iterations and EM convergence tolerance parameters used by the algorithm.
See Haplotype Frequency Estimation Methods for more information on the details of each estimation algorithm.
Multiple Testing Correction
To account for the multiple testing problem, you can have additional output columns computed for each selected association test that perform a form of multiple testing correction.
Bonferroni and FDR multiple testing corrections as well as single value and full scan permutation tests can be applied to the chi-squared and regression p-value results.
See the Formulas and Theories chapter for an explanation of these correction methods in the False Discovery Rate and Permutation Testing Methodology sections.
Additional Outputs
To enable the most utility of your association test results, some convenient derivative statistics can be computed on your p-values.
- Output -log10(P): Computes the value −log10(p-value) for each p-value and multiple testing corrected p-value.
- Output data for P-P/Q-Q plots: Computes expected values for each p-value and their multiple testing corrected p-value. By plotting the expected vs. actual P values, you can create P-P or Q-Q plots. This option forces the -log10(P) output as well.
Also, if you select Haplotype frequencies when doing a per-haplotype test, columns for the overall and case/control frequencies for the haplotype responsible for the resulting row in the results spreadsheet will be listed beside the description of the haplotype.
Haplotype Association Tests Results
The results spreadsheet for the haplotype association tests will start with the column containing block numbers. If you selected pre-computed blocks, these are the block numbers for the blocks that were used in the association tests. If you generated block dynamically, these numbers reflect those generated blocks. Because of this column, you can select this spreadsheet from a LD plot (see Haplotype Block Sets and LD Graphs) as a marker block spreadsheet to visualize the blocks on a LD graph.
When performing per-haplotype tests, each row represents the tests for a given haplotype. The second column will contain a string representation of the haplotype that was tested.