Welcome to the LD and Haplotype Analysis Tutorial!
Updated: January 28, 2021
Packages: SNP Analysis, Power Seat
This tutorial leads you through various LD and haplotype analyses in SVS. For demonstration purposes, a simulated dataset is used consisting of actual Affymetrix 500K genotypes from the four HapMap populations (Phase II) mapped to the hg19 Human reference build GRCh_37, a simulated case/control status, and a simulated quantitative phenotype.
This tutorial does not cover quality assurance, and therefore no quality assurance steps have been performed on the data in this tutorial. As it may be appropriate to filter markers based on Hardy-Weinberg Equilibrium or filter out markers with low call rates and minor allele frequencies, it is recommended that you perform such measures with your own data prior to performing LD and haplotype analysis.
To follow along you will need to download and unzip the following file, which includes several datasets:
We hope you enjoy the experience and look forward to your feedback.
1. Generating LD Plots
The general workflow outlined in this tutorial is intended to emulate a study wherein one first does a whole genome scan on individual markers, then hones in on significant regions for a more in-depth investigation of LD and haplotypes.
Open the Project
- Launch Golden Helix SVS and choose File > Open Project.
- Navigate to the LD and Haplotypes Analysis.ghp file downloaded previously and click Open.
You’ll notice a couple of datasets already created in the project, including a joined spreadsheet of phenotype and genotype data for the HapMap samples (Phenotype Dataset + 500K Genotypes), as well as an association test results spreadsheet (Association Tests (Genotypic Tests)).
Generate a –log10 P-value plot
- Open the Association Tests (Genotypic Tests) spreadsheet.
- Right-click on the Chi-Squared –log10 P column (2) and select Plot Variable in GenomeBrowse.
A p-value plot is created. Notice that there are two regions of significance, one on Chromosome 14 and the other on Chromosome 22. In this part of the tutorial we will focus on the Chromosome 22 region.
- Before you move on, go back to the Project Navigator and rename the plot node just created to -log10 P + LD. (To do this, right-click on the node and select Rename Node.)
- Now, from the –log10 P + LD plot, copy and paste 22:37,284,796-37,342,082 into the Region text box (the text box that now says “1: 1 – MT: 16,569”) located at the top of the plot window. Press Enter on your keyboard to complete the “zoom”.
You should now be zoomed into a region on 22q12.3. (Figure 1-1).
Add LD Plot
You can add an LD plot to an existing graph from any spreadsheet that contains column marker mapped genotype data. In this case you want to generate an LD plot from the same genotype spreadsheet used to produce the association test results.
- From the -log10 P + LD plot, select File > Plot… and click the Project button (if it is not already selected). Select the Phenotype Dataset + 500K Genotypes – Sheet 1 spreadsheet and choose LD.
- Make sure the screen looks like Figure 1-2 and then click Plot & Close.
An LD plot will now appear above the p-value plot and an LD node will appear in the Plot Tree (Figure 1-3).
Notice the apparent block of LD (red) in the middle of the plot interrupted by a single SNP that is uncorrelated (blue) with the other markers.
2. Computing Haplotype Blocks
In SVS 8 you can compute haplotype blocks manually via the LD interface or automatically using the Gabriel, et al. method. This tutorial will lead you through a combination of both.
Automatically Computing Haplotype Blocks
- From the -log10 P + LD plot select the LD item in the Plot Tree, and under the Marker Blocks tab in the Controls box, select Visible Blocks from the Compute options.
NOTE: Selecting Blocks would compute haplotype blocks across the entire 500K dataset.
Notice that the top of the Haplotype Block Detection window displays for how many markers on how many chromosomes haplotype blocks will be computed. In this case blocks will be computed for 22 markers active in 1 chromosome.
- Use the default parameters and click Run.
The algorithm produces two haplotype blocks which appear as black outlined pentagons at the top of the LD plot (Figure 2-1).
One could argue there should only be one block instead of two. For this reason, SVS makes it easy to manually manipulate blocks when needed and then save the block definitions for subsequent analyses.
Manually Manipulating Haplotype Blocks
In this step you will manually define a single block from two separate blocks.
- Click inside the larger block. This will change the outline to green, and details for this block will appear in the Console window.
- While your mouse cursor is over the left edge, press and hold down your left mouse button. Then drag the cursor to the left, expanding the larger block over the smaller block. Release the mouse button and a new block will be created. (See Figure 2-2.)
NOTE: You can generate Haplotype frequencies for the selected block by clicking the option to Compute Haplotype Tables under the Marker Blocks Tab of the Controls dialog.
3. Comparing Multiple LD Plots
This step is covered as it may be useful in your own study to compare multiple LD plots to understand how the correlation structure in one dataset compares to that of a similar dataset, e.g. comparing a random set of Caucasians in your study with CEU samples of HapMap. Another useful example is comparing a less dense array (e.g. Affymetrix 500K) with a denser array (e.g. Affymetrix 6.0).
Though not a standard practice, for demonstration purposes this tutorial compares the overall LD structure of all HapMap populations with that of only Yorubans.
- From the -log10 P + LD plot viewer, right-click on the LD (either the LD item in the plot tree or on the title itself in the upper-left-hand corner of the LD plot) and select Edit Title…. Enter LD – All Populations.
- Open the Phenotype Dataset + 500K Genotypes – Sheet 1 spreadsheet, right-click on the Ethnicity (column 4) and select Activate by Category. Highlight YRI and click OK. This will inactivate all samples of a different ethnicity from YRI and create Phenotype Dataset + 500K Genotypes – Sheet 2.
- Then, in the -log10 P + LD plot viewer, go to File > Plot…, click the Project button (if it is not already selected), select the Phenotype Dataset + 500K Genotypes – Sheet 2 spreadsheet, then select the LD option. Click Plot & Close.
- A second LD plot should appear on top, pushing the original LD plot and the other plots farther down. Right-click on the second LD plot’s title and select Edit Title…. Set the new title to “LD – YRI Population”.
- Zoom into the region around the block defined in the LD – All Populations plot (Figure 3-1).
In this instance there is a slight difference in LD structure displayed in the two plots. If you observed this in your own data, you would want to investigate why such a difference exists.
- Go ahead and delete the LD – YRI Population LD plot by right-clicking its associated node in the Plot Tree and selecting Delete.
4. Haplotype Frequency Tables
Once you define a given haplotype block you can investigate haplotype and diplotype frequency estimates for both the entire population – broken down by cases and controls if applicable – and each individual sample in the dataset.
Generating Frequency Tables
- In the –log10 P + LD plot, select the LD – All Populations item in the Plot Tree and click the Compute Haplotype Tables button on the Marker Blocks tab.
- Keep the default values, except make sure that Per sample EM, Per sample diplotype, and Overall haplotype frequencies are selected. Click Run.
This will create three tables, one for each feature selected in the Haplotype Tables dialog.
The Block #2 – Haplotype Table contains overall haplotype frequencies for the entire sample set. Notice that only the first marker is listed in the row label column along with the various alleles represented in the haplotype.
- To see all the SNPs in the haplotype block go to the Project Navigator and select the Block #2 – Haplotype Table node. All SNPs are listed in the Node Change Log in addition to other summary statistics (Figure 4-1).
The Block #2 – EM Frequencies Table displays the various genotypes for each sample and their respective frequency estimates for each haplotype calculated with the EM algorithm.
The Block #2 – Diplotype Table displays each sample’s haplotype pair, combined as diplotypes, and each diplotype’s respective frequency estimates.
5. Haplotype Association Tests
Golden Helix SVS provides three methods for association testing on haplotypes–these are:
- Per haplotype association tests with a Case/Control phenotype,
- Per haplotype block association tests with a Case/Control phenotype, and
- Haplotype Trend Regression with either a Case/Control phenotype or a quantitative phenotype.
The first two methods will be discussed in this section and the following section of this tutorial, while the third method, Haplotype Trend Regression, will be discussed in Section 7 of this tutorial.
Performing Per Haplotype Association Tests
- Open the –log10 P + LD plot and select the defined block by clicking inside the block on the LD plot. The block boundary will change green to indicate it has been selected.
- Click the Selected Block button for Subset options on the Marker Blocks tab.
This creates a subset spreadsheet (Phenotype Dataset + 500K Genotypes – Marker Block Subset) of only those markers in the block. The phenotype data has been lost, so we will need to rejoin before proceeding to association testing.
- Open the Phenotype Dataset + 500K Genotypes – Sheet 1 spreadsheet, go to Select > Column > Inactivate All Columns, then reactivate the first 4 columns by left-clicking once on each column header.
- Go to File > Join or Merge Spreadsheets and select Phenotype Dataset + 500K Genotypes – Marker Block Subset and click OK.
- On the join dialog change the New dataset name: to Phenotype + Marker Block Subset, leave all other options as their defaults and click OK.
- In Phenotype + Marker Block Subset – Sheet 1, set the C/C phenotype as dependent by clicking once on the column header, turning it magenta. Then, select Genotype > Haplotype Association Tests.
The Haplotype Association Tests window appears with a number of parameter settings. Set the parameters as follows:
- In this case we are treating all markers in the subset spreadsheet as a single block. Thus, under Haplotype Block Definition, select Use all markers as single block. Also check Show marker names in output at the bottom of this box.
- Under Haplotype Association Tests, select Calculate per haplotype.
- Under Tests select Chi-squared test and Odds ratio with 95% CI.
- Under Multiple Testing Correction check only Bonferroni adjustment (on N covariates).
- Under Additional Outputs, check Haplotype frequencies and Output data for P-P/Q-Q plots.
- Click Run to finish.
A single spreadsheet (Haplotype Association Tests (Per Haplotype)) is produced with a row for each haplotype and a column for each test statistic selected. Notice again that only the first marker of the block is represented in the row label column. However, in this case, because we selected Show marker names in output, the entire list of markers used is also output (in column 2).
Performing Per Block Haplotype Association Tests
Another, perhaps more informative test of association is a per block test where a 2 X N chi-square table is used with N being the number of haplotypes represented.
- Again, open the Phenotype + Marker Block Subset – Sheet 1 spreadsheet and select Genotype > Haplotype Association Tests.
- Leave all the parameters the same except this time select Calculate per block under Haplotype Association Tests. Also uncheck Show marker names in output under Haplotype Block Definition.
- Click Run to finish.
A new spreadsheet is created (Haplotype Association Tests (Per Block)) with a single row of data representing per-block association results.
6. Large-Scale Haplotype Association Testing
Now that you know how haplotype association testing works on a single haplotype, perhaps you want to investigate haplotypes on a larger, multi-haplotype scale. For the sake of computation time, this tutorial will lead you through haplotype association on chromosome 22 only. This workflow, however, can be applied directly to the entire genome.
Calculating Haplotype Blocks for Chromosome 22
- Open the Phenotype Dataset + 500K Genotypes – Sheet 1 spreadsheet and select Select > Activate by Chromosomes.
- Click Uncheck All and then check 22 and click OK.
This will create a new spreadsheet (Phenotype Dataset + 500K Genotypes – Sheet 4) where only genotypes in Chromosome 22 are active along with the phenotype data.
- Rename this spreadsheet (using the Project Navigator) to Chr22 Genotypes.
- From Chr22 Genotypes select Genotype > Haplotype Block Detection.
- Keep the defaults and click Run.
A new block definition spreadsheet will be created (Haplotype blocks, 1362 markers in 542 groups) with a single column representing various markers and the blocks to which they belong (Figure 6-1).
Haplotype Association Tests Using Block Definitions
- Open Chr22 Genotypes and select Genotype > Haplotype Association Tests.
- Under Haplotype Block Definition select Use precomputed blocks. Also check Show marker names in output.
- Click Select Sheet. Select the Haplotype blocks, 1362 markers in 542 groups block definition spreadsheet and click OK.
- Keep the rest of the parameters the same as before (make sure Calculate per block is selected) and click Run.
A new p-value spreadsheet will be created, Haplotype Association Tests (Per Block), this time with results for each haplotype block defined across chromosome 22.
- In the Project Navigator, rename this spreadsheet to Haplotype Association Tests (Per Block) – Chr22.
Plotting Haplotype and Single Marker Association Results Together
To see if haplotypes provide additional power in association testing, you can compare haplotype association results side-by-side with single marker association results.
- Open the -log 10 P + LD plot and select the first Chi-Squared -log10 P node in the Plot Tree. Under the Add tab select Add Plot Item(s).
- Click the Project Button (if it is not already selected), select the Haplotype Association Tests (per Block) – Chr22 spreadsheet, then select Chi-Squared -log10 P and click Plot & Close.
- In the Plot Tree, rename the first Chi-Squared –log10 P graph item (by right-clicking it, then selecting Edit Title) to Haplotype –log10 P. Rename the second Chi-Squared –log10 P graph item to Single Marker –log10 P.
You can change the attributes of the Haplotype –log10 P graph item to differentiate it more from the Single Marker –log10 P graph item.
- Select the Haplotype –log10 P graph item and under the Display tab change the Connector from None to Drop Line. Increase the weight (the number in the box to the right of Drop Line) to 3.
- Under the Style tab change the color to green and the symbol size (the number in the box to the right) to 5.
- Zoom into the region surrounding the peak on chromosome 22 by copying and pasting 22:36,372,272-37,995,325 into the location bar at the top of the plot window. The result is shown in Figure 6-2.
You can add the generated block set to the LD plot.
- From the Plot Tree, select the LD – All Populations plot, and under the Marker Blocks tab click Load under the Blocks options.
- Select the Haplotype blocks, 1362 markers in 542 groups spreadsheet and click OK.
You now have a p-value plot with single marker and haplotype association results along with an LD plot of Chromosome 22 with automatically defined haplotype blocks.
You can zoom in to any region by left-click-and-dragging in either graph.
- Press the left mouse button on the p-value plot’s x-axis on one side of the significant peak in chromosome 22 and, holding the mouse button down, drag to the other side of the significant peak (Figure 6-3).
7. Haplotype Trend Regression
As stated in the beginning of Section 5 of this tutorial, Haplotype Trend Regression is able to perform association testing (specifically regression) of haplotypes with either a Case/Control phenotype or a quantitative phenotype.
Haplotype Trend Regression (HTR) takes one or more block(s) of genotypic markers and, for each block of markers, estimates the haplotypes for these markers, then regresses their by-sample haplotype probabilities against a dependent variable.
Just as with haplotype association testing, Haplotype Trend Regression may be used on a small scale (such as with one haplotype block) or on a large scale (such as with a genome-wide sliding window).
We will now exercise this feature using haplotype blocks detected on one chromosome, which will be regressed against a quantitative phenotype.
Full Model Regression
- Open the Phenotype Dataset + 500K Genotypes – Sheet 1 spreadsheet and left-click once on the C/C phenotype to inactivate the column. Now set the quantitative variable Pheno as dependent.
NOTE: For this simulated phenotype, performing a Corr/Trend Association test using an Additive model would show genome-wide significance for several markers in Chromosome 14. Thus, for the purpose of saving time, we will only look at markers in Chromosome 14.
- Go to Select > Activate by Chromosome, click Uncheck All, then check 14 and click OK.
- We will first compute our haplotype blocks for Chromosome 14 to use in the analysis by selecting Genotype > Haplotype Block Detection. Leave the defaults and click Run.
- Then, from the Phenotype Dataset + 500K Genotypes – Sheet 5 spreadsheet, go to Genotype > Haplotype Trend Regression.
- Under Haplotype Block Definition select Use precomputed blocks and select the Haplotype blocks, 4105 markers in 1537 groups block definition spreadsheet. Make sure Show marker names in spreadsheet output is checked and leave the rest of the settings at their default values (Figure 7-1). Click Run.
The resulting spreadsheet Haplotype Trend Regression Results is produced. The rows of this spreadsheet correspond to the haplotype blocks used. The row labels will correspond to the first marker in the block. The names of the markers comprising each block are shown in Column 2.
- Plot the results by right-clicking on the -log10 Full-Model P column and selecting Plot Variable in GenomeBrowse. Double-click on Chromosome 14 to zoom in on it.
- Zoom into the area around the most significant block (first marker SNP_A-1859412) and add the LD Plot. (Add the LD plot by selecting File > Plot…, clicking the Project button (if it is not already selected), choosing the Phenotype Dataset + 500K Genotypes – Sheet 5 spreadsheet, checking LD and clicking Plot & Close.)
- Add in the computed marker blocks by selecting the LD node in the Plot Tree, then under the Marker Blocks tab select Load under the Blocks options. Select the Haplotype blocks, 4105 markers in 1537 groups block definition spreadsheet. The plot window should look similar to Figure 7-3.
Note that in the two haplotype blocks having SNP_A-1859412 and SNP_A-4204238 as their first markers, respectively, there is high linkage disequilibrium among the markers within each of these haplotype blocks. Since these blocks correspond to the two most significant p-values for this haplotype regression analysis, we conclude that the haplotypes within these blocks are clearly defined and have a very significant effect on the phenotype being analyzed.
Full vs Reduced Model Regression
Haplotype Trend Regression can also be used while correcting for any covariates. In this tutorial dataset we have some potential covariates included in the data–namely, Ethnicity and Gender.
- Open Phenotype Dataset + 500K Genotypes – Sheet 5 and once again select Genotype > Haplotype Trend Regression.
- Under Haplotype Block Definition select Use precomputed blocks and choose the Haplotype blocks, 4105 markers in 1537 groups spreadsheet.
- Under Fixed Covariates, select Add Covariate and choose the Ethnicity column. Click Add then Close.
- Choose the Use as the reduced model for a full-vs.-reduced regression option under Fixed Covariate Options.
- Make sure Show marker names in spreadsheet output is checked and leave the rest of the settings at their defaults (Figure 7-4). Click Run.
Now we will add these results to the previous plot to compare.
- Open the Plot of Column -log10 Full-Model P from Haplotype Trend Regression Results and select the first -log10 Full-Model P node in the Plot Tree. Then on the Add tab click Add Plot Item(s). Select the second Haplotype Trend Regression Results spreadsheet and check -log10 FvR Model P to be added to the plot. Click Plot & Close.
- Change the color of the new points by selecting the -log10 FvR Model P node in the Plot Tree and under the Style tab clicking the blue square and changing it to green.
You will see that the same two blocks are still significant after correcting for ethnicity, but not quite as significant as before (Figure 7-5).