Welcome to the SVS SNP Genome-Wide Association Study (GWAS) Tutorial!
Updated: February 22, 2021
Version: 8.8.3 or higher
The following tutorial is designed to systematically introduce you to a number of techniques for genome-wide association studies. It is not meant to replicate all the workflows you might use in a complete analysis, but instead touch on a sampling of the more typical scenarios you may come across in your own studies.
The genotype data included is a portion of a public GWAS dataset from the Gene Expression Omnibus database, as well as 270 HapMap samples. There is a population spreadsheet that identifies the HapMap subpopulation and the study data. Both studies were run on the Affymetrix 500K genotyping array. All phenotype data is simulated.
To follow along you will need to download and unzip the following file:
We hope you enjoy the experience and look forward to your feedback.
NOTE: All open windows (except the Project Navigator) can be closed after each section’s completion.
1. Sample QA I: Basics
A. Open Project
For genetic association analysis in SVS you will need, at minimum, three types of data: phenotype data, genotype data, and a genetic marker map for the genotype data. Additional preparatory steps ensure that the analysis runs correctly for your study.
You need to have a project open before you can import data or perform analyses.
- From the Welcome Screen select File > Open Project. Locate the Golden Helix project file SNP_GWAS_Tutorial.ghp and select it, then click Open. This will open the Project Navigator.
- Select Tools > Current Project’s Options and confirm that the correct default genome assembly is selected. For the data in this tutorial, the genome assembly should be set to Homo sapiens (Human), GRCh37 (hg19) (Feb 2009). This option primarily affects the genomic view and which data sources are available in the integrated GenomeBrowse plotting interface.
There should be four dataset nodes visible in the project: Phenotype, Population, 500K Geno Training Data, and 500K Geno HapMap Data.
- Double-click on 500K Geno Training Data – Sheet 1. The spreadsheet contains the genotype information for the study sample data. There are total of 565 study samples with 498,784 SNPs as indicated in the top right portion of the window (rows x columns).
- Click on the green map button in the top-left corner of the spreadsheet to view the Marker Map information and other marker information included with the Marker Map file (Figure 1-1).
The following sections lead you through quality assurance procedures performed in genome-wide association studies to identify samples of poor quality (call rates, heterozygosity, etc.) and those whose identity is of question (mismatched gender, ethnicity different than intended for the study, cryptically related, etc.). In some cases we’ll automatically filter samples; in others we’ll just identify samples for possible exclusion.
All of the sample statistics can be calculated with one tool. In the first step you will create a master sample statistics spreadsheet. Then you will use this spreadsheet to filter out samples based on a variety of undesirable characteristics.
B. Calculate Sample Statistics
Use the Genotype Statistics by Sample tool to calculate sample call rates and heterozygosity rates over the entire genome and over autosomes only.
- in the 500K Geno Training Data – Sheet 1 choose Genotype > Genotype Statistics by Sample.
- Under Gender Chromosome Statistics, check Gender inference: and choose X from the dropdown. Leave the default threshold value. Under Additional Outputs (Verbose Output), check Output count and variant statistics for each autosomal chromosome. The window should match Figure 1-2. Click Run.
Two output spreadsheets are created containing the statistics. The first spreadsheet, Statistics by Sample, contains the call rate and the X-chromosome heterozygosity rate. The second spreadsheet, Autosome Statistics by Sample, contains the statistics calculated separately for each autosome.
C. Filtering Samples with Low Call Rate
First filter out samples with low call rates. The sample call rate is defined as the fraction of called SNPs in the sample over the total number of SNPs in the dataset. A standard quality threshold for excluding samples with a low call rate is 95%. It is important to not include the Y chromosome when calculating per sample call rates. It is up to the researcher whether or not the X chromosome should be included in call rate calculations. If the X chromosome is not to be included, use the autosome-only call rate to determine which samples should be filtered.
- Open the Statistics by Sample spreadsheet and right-click on the Call Rate (Autosomes) column header (Column 9). Select Activate by Threshold and activate all samples with Threshold Value: >= 0.95. (Leave the other options in the Activate by Threshold dialog box at their default values.) Click OK.
In the top-right corner, the number of active rows is listed at 468 (out of 565 total). This means 97 samples had an autosomal call rate < 95%.
- Create a row subset spreadsheet by going to Select > Row > Row Subset Spreadsheet. Rename the resulting spreadsheet by right-clicking on the area at the bottom of the spreadsheet window and selecting Rename. Change the name to: Samples with Call Rate >= 0.95.
Now use the renamed spreadsheet to compare X chromosome heterozygosity rates.
D. Genotype Gender Check
Use the X chromosome heterozygosity rate to identify those samples whose inferred gender does not agree with their reported gender. Since the reported gender information is contained in the phenotype spreadsheet, you will first need to join the sample statistics spreadsheet with the phenotype information. Then create a histogram of the Heterozygosity rate and filter based on reported gender to detect discrepancies.
- From Samples with Call Rate >= 0.95 choose File > Join or Merge Spreadsheets. Select the Phenotype – Sheet 1 spreadsheet and click OK.
- Choose Current Spreadsheet under Spreadsheet as Child of. Leave all other parameters as the defaults and click OK.
In the combined spreadsheet, inferred gender is located in the 22nd column (Inferred Gender) and reported gender is located in the 29th column (Gender).
- Right-click on the Het Rate from All Columns (Chr. X) column header (Column 18) and choose Plot Histogram. There are two distinct distributions. The left one represents males and the right one females.
- To better visualize the distributions, in the plot viewer, click on the Graph 1 node in the Graph Control Interface and set the Bin Count parameter to 128.
- Next, click on the Het Rate from All Columns (Chr. X) node in the Graph Control Interface and select the Color tab. Select the By Variable radio button and click Select variable….
- Scroll to Gender, select it and click OK. Then click the blue box next to the Gender == ‘F’ option and change the color to Yellow. Follow a similar process to change the Gender == ‘M’ to blue. With the plot colored on reported gender, the inconsistencies between inferred and reported gender are evident. More inconsistencies can be seen by changing the opacity.
- While the Het Rate from All Columns (Chr. X) node is still selected in the Graph Control Interface, click on the Item tab and move the opacity bar to the left until it is in the center of its range.
In this example, there are four samples who have a reported gender opposite of what is characteristic according to the heterozygosity rate. We will compare the values of the two columns to identify mismatched samples.
- From the spreadsheet Samples with Call Rate >= 0.95 + Phenotype – Sheet 1, select Select > Compare and Activate by Column Agreement. Click Add Columns and select the two previously mentioned categorical columns (Inferred Gender and Gender) in the menu. Click OK.
NOTE: Be sure that both columns have a C to the left of the column name. These categorical columns use the same naming conventions, ‘M’ and ‘F’, to identify gender.
- Under Create subset spreadsheet(s) of:, check both Rows with matching data values and Rows with differing data values. This will allow you to easily examine and work with both samples that have consistent genders and samples that have inconsistent genders.
- Confirm that the options in the window match those in Figure 1-4. Click OK.
Two subset spreadsheets are created: Rows with matching values in columns Inferred Gender and Gender and Rows with differing values in columns Inferred Gender and Gender.
NOTE: In your own study, you may be able rectify the spreadsheet by verifying that the gender was simply a data entry error and not a genotyping anomaly.
E. Apply Filtering Results to Genotype Spreadsheet
Now to exclude the filtered samples from the genotype dataset. To do this, use the rows in the Rows with matching values in columns Inferred Gender and Gender spreadsheet to activate their corresponding rows in the 500K Geno Training Data spreadsheet.
- Open the Rows with matching values in columns Inferred Gender and Gender spreadsheet and choose Select > Apply Current Selection to Second Spreadsheet.
- Choose to Apply filtered rows to the following spreadsheet 500K Geno Training Data – Sheet 1, then click OK.
Now, you will notice, by looking in the upper-right portion of the window, that there are only 464 rows active out of 565. Create a subset with these samples.
- Choose Select > Row > Row Subset Spreadsheet.
- Rename this spreadsheet in the Project Navigator (right click on the node and select Rename Node) to Subset – Samples with Call Rate >= .95 and Matched Gender.
NOTE: You can close all open spreadsheets from the project navigator by going to Window > Close All.
NOTE: Hiding child nodes in the project navigator that you will not be actively working with can help make the project easier to navigate. Clicking on the arrow next to 500K Geno Training Data – Sheet 1 will result in the streamlined project navigator view shown in Figure 1-5.
2. Sample QA II: Cryptic Relatedness
Next, we will find and filter samples determined to be “related” to other samples. Relatedness is often defined as family-relatedness, but identity by descent (IBD) estimation can also detect duplicate samples, duplicate samples that occur from one of a pair of genotyping chips but not the other, or sample contamination. Before doing IBD, standard practice is to first prune the SNPs in LD with one another, reducing the number of association tests performed and thus the effect of multiple testing.
- Open Subset – Samples with Call Rate >= .95 and Matched Gender.
- Choose Genotype > Quality Assurance and Utilities > LD Pruning.
- Input 100 for Window Size, 5 for Window Increment, r^2 for LD Statistic, 0.5 for LD Threshold, and CHM for LD Computation Method. Click OK.
This will take a few minutes. Upon finishing, a message will appear stating that 286,502 markers have been inactivated. These markers are designated by the grayed-out columns in the spreadsheet. You can also see in the upper-right portion of the spreadsheet window that only 212,282 columns are active out of 498,784 total columns.
To perform IBD, you will be using only the active columns (non-correlated markers) for autosomal chromosomes, so first create a column subset spreadsheet and then inactivate the X chromosome before continuing.
- Choose Select > Column > Column Subset Spreadsheet.
- Then from the subset spreadsheet, choose Select > Activate by Chromosomes, uncheck the X chromosome and click OK.
NOTE: In datasets with other non-autosomal chromosomes, inactivate these as well.
- In the Project Navigator rename this node to Pruned SNP Subset.
Now you are ready for IBD estimation.
- From the Pruned SNP Subset spreadsheet, choose Genotype > Quality Assurance and Utilities > Identity by Descent Estimation.
- For this exercise:
- Uncheck Output IBS distances ((IBS 2 + 0.5*IBS 1)/# non-missing markers).
- Uncheck Output untransformed estimates of P(Z=0), P(Z=1), and P(Z=2).
- Make sure that Output PI = P(Z=1)/2 + P(Z=2) is checked.
- Make sure that Output transformed estimates P*(Z=0), P*(Z=1), and P*(Z=2) is unchecked.
- Check Output all pairs where PI >= ___ and enter 0 in this blank.
- Click Run.
This will take a few minutes. Upon completion, two spreadsheets are output: IBD Estimate: Estimated PI and Pairwise IBD Estimates (PI >= 0).
IBD Estimate: Estimated PI gives an N x N table where N is the number of samples in the dataset. By plotting a heatmap of this table we can detect patterns showing relatedness. Pairwise IBD Estimates (PI >= 0) outputs various IBD statistics for all samples. These values can be sorted or plotted to find samples related to one another or detect sample contamination.
- From the IBD Estimate: Estimated PI, choose Plot > Heat Map (Uniform).
You’ll get the plot in Figure 2-1.
By default, the heat map has a three color scheme calculated automatically. We want to define the color scheme manually based on a two color scheme where we look for sample pairs with a PI estimate of 0.25 or greater (PI of 0.25 = second degree relatives, 0.5 = first degree relatives, and 1 = identical twins (or duplicate samples)).
- Click on the IBD Estimate: Estimated PI node in the Graph Control Interface (upper-left portion of the window).
- On the Color tab choose Manual and then right-click the first option (0), and select Delete.
- Right click on the new first parameter, select Edit, and change it to 0.2. Then click once on the color box next to the second parameter and select red. Click OK. Your plot should look similar to Figure 2-2.
You can begin to see samples along the diagonal line that appear to be related to one another. You can zoom in to see this more clearly by clicking and dragging a red box in the plot area (Figure 2-3).
In most population-based studies, you’ll typically find a couple of sample pairs that are cryptically related in one way or another, which you can subsequently remove. In this particular study there are known to be several family trios, and the cluster pattern above indicates this. For the purpose of this tutorial we won’t exclude any sample pairs here, but if this were your own study, you would need to decide which member(s) of the trio to keep or discard from the study.
If you need to exclude samples from your study, the Pairwise IBD Estimates (PI >= 0) output can be used. Open the spreadsheet and right-click on the Estimated PI column (Column 7) and select Sort Descending. Those sample pairs that are highly related will be listed at the top of the spreadsheet. You can then manually exclude either one or both samples of each pair from your dataset.
3. Sample QA III: Population Stratification
The next step is to identify samples that depart from the expected homogenous ethnicity of your study. You can do this by performing principal component analysis on your data and comparing the first two principal components against reference samples of known ethnicities.
There are several ways to perform PCA. Some recommend using the pruned set of SNPs (as was done with IBD in this tutorial), some recommend using a filtered set of SNPs (on minor allele frequency and HWE for example), and some recommend using the entire SNP set. There are advantages to each. Here, we’ll use the pruned set of SNPs we created earlier. First, we need to append the HapMap samples to our study samples.
- Open the Pruned SNP Subset spreadsheet and choose File > Append Spreadsheets.
- Choose the 500K Geno HapMap Data – Sheet 1 and click OK.
- In the Append Spreadsheet window enter Pruned SNPs + HapMap as the New Dataset Name, keep the rest of the parameters as the defaults and click OK.
The new spreadsheet should have 734 rows and 207,706 columns.
NOTE: PCA should not be run on non-autosomal chromosomes. However, in our data, this issue has already been handled, because we inactivated the X Chromosome to create the Pruned SNP Subset spreadsheet, and thus this chromosome was automatically dropped during the append process we just performed.
Now run PCA on this spreadsheet.
- From the spreadsheet, select Genotype > Genotype Principal Component Analysis.
- Under Principal Components, enter 5 for Find up to top ___ components.
- Leave the defaults for the rest of the options and click Run.
Two spreadsheets, the Principal Components (Additive Model) spreadsheet and the PC Eigenvalues (Additive Model) spreadsheet, result from the analysis. To find out how many principal components are required to explain the majority of the population stratification, a PCA plot will be created and the eigenvalues will be visually inspected.
Look at the PC Eigenvalues (Additive Model) spreadsheet and notice that there is very little change between the third, fourth and fifth eigenvalues, and that these are much smaller than either of the first two eigenvalues. This implies that two principal components explain the majority of stratification in the SNP data. This is consistent with there being two different dimensions of stratification between the three major populations in the data: CEU (European), YRI (African), and CHB+JPT (Asian). We shall visualize the population stratification by plotting these first two principal components against one another, juxtaposing the HapMap samples with the GEO study data.
First, you need to join the Population spreadsheet with the Principal Components spreadsheet.
- Open the Population – Sheet 1 spreadsheet and select File > Join or Merge Spreadsheets.
- Select the Principal Components (Additive Model) spreadsheet.
- In the Join or Merge Spreadsheets window select the Current spreadsheet radio button under Spreadsheet as Child of, leave the rest of the parameters as defaults and then click OK. The combined spreadsheet should look like Figure 3-1.
It is now possible to plot one component against the other and color-code each sample or data point according to its respective ethnicity.
- From the Population + Principal Components (Additive Model) – Sheet 1 spreadsheet, select Plot > XY Scatter Plots.
The XY Scatter Parameters dialog appears with two list views. The list view on the left is for selecting the column (principal component) to represent the independent or X axis. The list view on the right is for selecting a single or multiple columns (principal components) to represent the dependent or Y axis.
- In the left list box select EV = 35.8391. In the right list box check EV = 15.5366 and click Plot.
If we color each data point according to its respective ethnicity, the clusters become more obvious.
- In the Graph Control Interface in the upper-left pane of the Plot Viewer, select the Item EV = 15.5366.
- Select the Color tab and select the By Variable radio button
- Click Select Variable… and select Population from the list. Click OK.
The four population groups are separated by color in the plot (Figure 3-2). We can see that our study consisted mostly of Caucasians (largest cluster), but also had Asians and some Asian-Americans and African-Americans. Four separate graph items, one for each Population category, are displayed in the Graph Control Interface, making it easy to change the name, color and symbol for each.
- When finished, close the Plot Viewer and rename its associated node in the Project Navigator (under the Population + Principal Components (Additive Model) – Sheet 1 spreadsheet node) to PCA Plot.
NOTE: If it is important to exclude or keep samples based on whether or not they are in a population cluster, we recommend using the Numeric > Multidimensional Outlier Detection feature for doing this, rather than visual inspection of the PCA plots. The threshold of which samples are in a population cluster can be ambiguous if it is based on visual inspection alone.
NOTE: To use the Numeric > Multidimensional Outlier Detection feature to exclude samples outside of a population cluster, calculate the inter-quartile range (IQR) distance around the centroid (median) of the study population cluster and exclude those samples that are 1.5 IQRs from the third quartile. (See Multidimensional Outlier Detection for examples of this process.)
In summary, samples were filtered out if they had a call rate below 95% or their reported and genotypically inferred genders did not match. We will not use these samples for the rest of the tutorial. We have also identified other samples that could be excluded for cryptic relatedness–however, we will still use these other samples for the rest of the tutorial. Lastly, we have identified population stratification in the dataset.
Next, we will filter SNPs based on standard SNP quality assurance metrics.
4. SNP Quality Assurance
Because we will be using a case-control phenotype, and because this phenotype will be involved with one of the SNP-wise quality assurance measures, we will merge the phenotype and genotype information now. (Specifically, we will be filtering on Hardy-Weinberg equilibrium (HWE), and that will be based on controls only.)
NOTE: If your phenotype is quantitative, you can still merge the phenotype and genotype information at this point–you simply will not need the phenotype information until you actually start performing association testing.
- Open the Phenotype – Sheet 1 spreadsheet and choose File > Join or Merge Spreadsheets.
- Select the 500K Geno Training Data – Sheet 2 spreadsheet and click OK.
- Leave all the parameters in the Join or Merge Spreadsheet window as the defaults and click OK.
As before, we will exclude the X chromosome from the following analysis. Also, we will specify case/control status, because, as stated above, we will only calculate Hardy-Weinberg Equilibrium (HWE) based on control samples.
- Choose Select > Activate by Chromosomes, uncheck the X chromosome, and click OK.
- Next left-click the Phenotype 1 column label header. This will turn the column magenta, denoting the column as the dependent variable.
NOTE: If your phenotype is quantitative, and you are following this procedure, do not left-click the header of your phenotype column (yet).
The spreadsheet should look like Figure 4-1.
- From the joined spreadsheet choose Genotype > Genotype Filtering by Marker.
The Genotype Filtering window lets you simultaneously choose thresholds for multiple statistics to filter out SNPs failing to meet respective quality assurance measures.
- Check the following options and enter the following thresholds:
- Drop if Call Rate < 0.9
- Drop if Minor Allele Frequency (MAF) < 0.01
- Perform HWE filtering based on: Controls
- Drop if Fisher’s exact test for HWE P-Value < 1e-4
- Make sure Inactivate genotype columns that meet above criteria for filtering is checked.
- Make sure Output spreadsheet with marker statistics and ‘Drop?’ columns is checked.
- Make sure Classify alleles by allele frequency is selected at the top of the dialog.
- Make sure all other boxes are unchecked.
- Click Run.
NOTE: If your phenotype is quantitative and you are following this procedure, you will have to leave Perform HWE filtering based on: set to All, because you won’t be able to change it. This is fine. Keep all the other parameters the same.
Upon completion, SNPs in the Phenotype + 500K Geno Training Data – Sheet 1 not meeting the specified thresholds will be inactivated. A new spreadsheet, Filtering Results, will also be output with the various marker statistics for each SNP.
- To see how many SNPs were filtered, go to the Project Navigator and select the Phenotype + 500K Geno Training Data – Sheet 1 spreadsheet. At the end of the Node Change Log will be an individual log noting, among other things, how many SNPs were filtered out (columns set to inactive).
Assuming all steps were followed correctly to this point, 104,035 SNPs should have been filtered out. Though any further analyses only take active columns and rows into consideration, it is often preferred to first create a subset spreadsheet of only those that are active.
- From the Phenotype + 500K Geno Training Data – Sheet 1 spreadsheet, go to Select > Subset Active Data.
The new spreadsheet, Phenotype + 500K Geno Training Data – Active Subset, should have 464 rows and 384,262 columns.
NOTE: If your phenotype is quantitative and you are following this procedure, now will be a good time to make it dependent. Inside this new subset spreadsheet, left-click on the name header of your quantitative phenotype, turning it magenta.
- Rename this spreadsheet in the Project Navigator to Filtered Data for Association Testing, then add a color tag to the node by right-clicking and selecting Tag Node Green. Colored tags are useful for easily locating specific spreadsheets.
You should now have a filtered set of samples and SNPs for association testing. Your project should look similar to Figure 4-2.
5. Genotype Association Analysis
After assuring the quality of the data, association testing can be performed.
A. Genotype Association Testing
- Open the Filtered Data for Association Testing spreadsheet and make sure the Phenotype 1 Binary column header is still set as the dependent variable.
- Choose Genotype > Genotype Association Tests.
- Make sure that, above the Association Test Parameters tab, Classify alleles by allele frequency is selected.
- In the Association Test Parameters tab, under Genetic Model or Tests, make sure the Additive Model: (dd) -> (Dd) -> (DD) radio button is selected.
- Under Test Statistic or Method, check only Correlation/Trend test and Exact form of Cochran-Armitage test. Leave the other tests unchecked.
- Under Multiple Testing Correction, make sure Bonferroni Adjustment is checked and uncheck the other options.
- Check Output data for P-P/Q-Q Plots.
- Make sure Correct for stratification with PCA is unchecked.
- In the Overall Marker Statistics tab, check Genotype counts under Count Tables and make sure all other options in this tab are unchecked. Click Run.
- Upon completion, a new spreadsheet is created, Association Tests (Additive Model). This spreadsheet displays several association statistics for each SNP (Figure 5-1).
When a SNP has a significant p-value according to the Correlation/Trend test, but the contingency table of Case Status by Number of Minor Alleles shows at least one count less than 5, the results from the Exact Cochran-Armitage Test should also be examined, since the assumptions of the Correlation/Trend test have been violated for those cases.
- Right-click on the Corr/Trend -log10 P column and select Sort Descending to bring the results for the SNPs that have the most significant P-values (according to the Correlation/Trend Test) to the top of the spreadsheet display.
Let us view the contingency table for the most significant marker (according to Corr/Trend), which is SNP_A-2070191 .
- Scroll the results spreadsheet out to Column 15. Columns 15 through 28 have all the data to reconstruct the contingency table for each marker. (Ignore Columns 29 through 31, since, for the additive model, association testing always drops samples for which the marker has missing data.)
We see that the contingency table for marker SNP_A-2070191 is as follows:
This results in an Exact Armitage P-Value of 1.869e-7. We see that this is not that much different from the Corr/Trend result (of 2.4997e-7) because all cell counts in this contingency table are 5 or higher.
We also see that this marker appears to have the most significant p-value according to the Exact Armitage test (as well as according to the Correlation/Trend Test).
- To check this, right click on the header of Exact Armitage -log10 P (Column 10) and click Sort Descending. The spreadsheet row for marker SNP_A-2070191 will remain on top.
B. Generating Q-Q Plots
Q-Q plots are generated by plotting the expected chi-squared values against the observed chi-squared values.
- From Association Tests (Additive Model), select Plot > XY Scatter Plots. Two list views will appear.
- Select Corr/Trend expected X^2 (seventh down) in the left list box and Corr/Trend X^2 (fourth down) in the right list box.
- Click Plot.
- In the Graph Control Interface, Select Graph 1.
- Under the Add Item tab select f(x) = m(x) + b and click Add. This will generate a straight line with a slope of 1 and y-intercept of 0.
You should have a Q-Q plot that looks like Figure 5-2.
NOTE: If you want to change the weight and color of the straight line, you can, in the Graph Control Interface, select its associated graph item (f(x) = 1.0(x) + 0.0) and, in the Item tab to the right of Line, choose the color and weight you would like for the line.
- When you’ve finished, close the Plot Viewer and rename its associated node in the Project Navigator to Q-Q Plot.
Similarly, you could also plot a P-P plot for the Corr/Trend results by using Corr/Trend expected -log10 P on the X axis and Corr/Trend -log10 P on the Y axis, or plot a P-P plot for the Exact Armitage results by using Exact Armitage expected -log10 P on the X axis and Exact Armitage -log10 P on the Y axis.
(For this example, it turns out that the Q-Q plot and both of these P-P plots are quite similar.)
C. Generating P-Value Plots
- From the Association Tests (Additive Model) spreadsheet, right-click on the header of the Corr/Trend -log10 P column (Column 2) and select Plot Variable in GenomeBrowse.
Notice the full-domain view now has chromosome bands and the X-axis is represented by chromosome and physical position (Figure 5-3).
Note that the most significant p-values are in Chromosome 6. Let us zoom in for a closer look.
- There are many ways to zoom in the GenomeBrowse window:
- Double-clicking a chromosome’s diagram (or its number) in the Full Domain View (top plot).
- Manually selecting a chromosome in the chromosome selector dropdown (near the top left of the plot window).
- Manually entering a chromosome range (such as 5 – 6) in the chromosome selector dropdown (near the top left of the plot window), then clicking on the right-arrow icon just to the right of the Genomic Location Bar (at the top of the plot window above the Full Domain Band).
- Manually entering a chromosome and/or position or a gene name in the Genomic Location Bar (at the top of the plot window above the Full Domain Band).
- Using the mouse scroll wheel.
- Using the zoom slider.
- And last of all, you can zoom into any region in the plot by selecting the magnifying-glass icon (Figure 5-4) on the zoom toolbar, then pressing your left mouse button at the left edge of the region and while holding down your left mouse button dragging the cursor across to the right edge of the region, finally releasing the mouse button to complete the zoom.
- Right now, double-click the 6 in the Full Domain View.
Zooming displays the karyogram view of chromosome 6. The annotation track provides more information about the SNPs and other genotypes that are in the region on display.
- Zoom further into the peak (click and drag on the x-axis) and left-click on the top-most point in the plot (Figure 5-5).
This displays, in the Console (bottom-left pane), the marker name, its -log10 p-value, chromosome, and position, along with other marker map and spreadsheet values.
You can see in what gene(s) this SNP and others in the peak reside by looking at the annotation gene source that is loaded by default into the plot.
- Zoom in even further around the top five or seven SNPs and, if necessary, expand the boundaries around the annotation track to give it more room.
You can also find a listing of information available in each plot by activating the Feature List for that plot. Right-click anywhere on an active plot and select Feature List (Figure 5-7).
NOTE: You may find that the Feature List comes out in its own window. If you wish, you may then dock that window into the bottom of the main display (which will make the main plot window match Figure 5-7).
D. Creating a Manhattan Plot
Manhattan plots are popular images for publication purposes as they are color-coded by chromosome, making it easy to see where significant markers reside.
- First, turn off the Feature List by clicking the X in its upper right corner.
- Then, in the Chromosome selection drop-down (near the upper-left-hand corner of the plot window) select All to reset the zoom.
- Select either of the two Corr/Trend -log10 P graph items in the Plot Tree.
- Under the Style tab, select Chromosome in the Style By: drop-down.
This will split the graph into 22 different colors, one for each chromosome (Figure 5-8). You can change the color, symbol choice, and size of each chromosome by selecting its respective option in the Style options box.
E. Saving Plots as Images
You can save any displayed plot in the plot view as an image.
- Choose File > Save as Image from the plot window.
This will bring up a preview window (Figure 5-9). Here you can manipulate various image parameters.
- Click Advanced in the upper-right corner of the display (unless the advanced options are already showing).
- Make sure the Domain View option under the Advanced options is unchecked.
- (Under the Advanced options, you can also change the margins of the image.)
- Next, Browse to a folder where you want the image saved, give it the name Manhattan Plot and click Save.
- Click Save again at the bottom of the preview window to save the image.
- Once the image is saved, close the Plot Viewer and rename its associated node in the Project Navigator to Manhattan Plot.
6. Genotypic Regression and Correcting for Covariates
If you build a model and the response is correlated with one or more variables relating to the samples, those variables (“confounding variables”) can sometimes hide real associations from the results of performing a test, or at least hide the true magnitude of these associations. We will next examine how to correct for confounding variables as a part of our association testing.
We will use the Genotypic Regression feature (Genotype > Genotypic Regression Analysis) to do this. This feature is very similar to Numeric > Numeric Regression Analysis, with the exception that genotypic data may be directly used as input. An additional third tab allows selection of genotype-specific parameters.
A. The Possibly Confounding Covariate
Let us first look at the phenotype and at the possibly confounding covariate that we will be dealing with in this section of the tutorial. We will be using Phenotype 2 as the phenotype, and we suspect Gender may be a confounding covariate.
- Open the Filtered Data for Association Testing spreadsheet, right-click on Phenotype 2, and choose Plot Histogram.
- In the Graph Control Interface, click on the Graph 1 node and, in the Graph tab, set the Bin Count parameter to 32.
- Next, in the Graph Control Interface, click on the Phenotype 2 node and select the Color tab. Select the By Variable radio button in this tab and click Select variable….
- Scroll to Gender, select it and click OK. Then click the blue box next to the Gender == ‘F’ option and change the color to yellow. Follow a similar process to change the Gender == ‘M’ to blue.
- While the Phenotype 2 node is still selected in the Graph Control Interface, click on the Item tab and move the opacity bar to the left until it is in the center of its range.
We see that the peak and the average of the distribution of Phenotype 2 for males are somewhat less than the peak and the average, respectively, of the distribution of Phenotype 2 for females. This indicates that Phenotype 2 mildly correlates with Gender. Therefore, we will want to correct for Gender when testing on Phenotype 2.
B. Genotypic Regression–Full Model Only
But before that, as a baseline, let’s test on Phenotype 2 without correcting for covariates.
- Open the Filtered Data for Association Testing spreadsheet and left-click once on Phenotype 1 to inactivate the column. Now set Phenotype 2 as dependent by left-clicking once on its column header.
- Select Genotype > Genotypic Regression Analysis.
- Under Selection Parameters, select Regress on each of the 384255 genotypic columns. Make sure all other checkboxes in this tab are unchecked.
- In the Output Parameters tab, check Output data for P-P/Q-Q plots, Bonferroni adjustment (on N covariates), and False Discovery Rate (FDR). Make sure all other checkboxes in this tab are unchecked.
- In the Genotypic Parameters tab, make sure Classify Alleles by major allele (d) vs. minor allele (D) as found in the data, Additive model: DD=2, Dd=1, dd=0, and Drop samples containing missing genotype values are checked. Check Genotype counts and make sure all other boxes in this tab are unchecked.
- Finally, make sure all parameters agree with Figures 6-2, 6-3, and 6-4, then click Run.
This process won’t take that long, even for the 384,255 markers, because this is a process that uses linear regression rather than logistic regression. (Linear regression is automatically being used because Phenotype 2 is quantitative rather than case/control.)
Let us examine the results.
- First, right-click on the bottom of the results spreadsheet window and select Rename. Rename the spreadsheet to Genotypic Regression Results – Full Model Only.
- Right-click on the header of -log10 Full-Model P and select Plot Variable in GenomeBrowse.
We see significant results in Chromosome 11.
- First, go back to the navigator node corresponding to this plot and rename it to Genotypic Regression Plot.
- Now, in the Genotypic Regression Plot, zoom in to Chromosome 11 by double-clicking on it, then zoom in immediately around the cluster of SNP’s that are very significant. (There should be 17 very significant SNP’s.)
We see (Figure 6-5) that the most significant SNP has a -log10 p-value of about 10.
C. Genotypic Regression–Full vs. Reduced Model
Now, let us perform our association test while correcting for Gender.
- Open the Filtered Data for Association Testing – Sheet 2 spreadsheet and make sure Phenotype 2 is still dependent.
- Go to Genotype > Genotypic Regression Analysis.
- Under Selection Parameters, make sure Regress on each of the 384255 genotypic columns is still checked.
- Under Regress on each of the 384255 genotypic columns, check Correct for covariate(s). In the Reduced Model Covariates section, click Add Covariate. Check Gender and click Add.
- The first tab should look like Figure 6-6. The other two tabs should still look like Figures 6-3 and 6-4. Click Run.
This will take a little longer, since there are two regressions for each marker–the full model (encompassing the full-model covariate and the reduced-model covariates) and the reduced model (encompassing only the reduced-model covariates).
- Now, right-click on the bottom of this results spreadsheet window and rename it to Genotypic Regression Results – Full vs Reduced.
- Scroll this spreadsheet over to the FDR column (Column 20) and right-click on the column header of FDR. Select Sort Ascending.
Notice that there are 17 markers for which the False Discovery Rate (FDR) is less than 0.05.
- Scroll the spreadsheet back to Columns 1 and 2, FvR Model P-Value and -log10 FvR Model P.
Observe that these same 17 markers have full-vs-reduced p-values less than 6.5 x 10-7 and -log 10 full-vs-reduced p-values greater than 6. We conclude that these markers are genome-wide significant.
Let us compare these results visually with the baseline results.
- In the Genotypic Regression Plot, go to the Plot Tree and click on the first -log10 Full-Model P node. In the Add tab under Controls, click Add Plot Item(s).
- Select the Project button (if it is not already selected). Choose the Genotypic Regression Results – Full vs Reduced spreadsheet, select -log10 FvR Model P, and click Plot and Close.
- In the Plot Tree, the -log10 FvR Model P node should be selected. In the Style tab under Controls, click on the blue square and change the color to green. The result should look something like Figure 6-7.
We see that most of these 17 markers now show as being somewhat less significant after having been corrected for Gender, with one marker having a corrected p-value 10 times larger than its uncorrected p-value.
Even so, the corrected p-values for all of these 17 markers are still genome-wide significant, as we saw by examining their False Discovery Rate.
7. Regressing on Covariate/Column Interactions
We saw in the last section (6. Genotypic Regression and Correcting for Covariates) that when the most significant markers resulting from testing against Phenotype 2 were corrected for Gender, their significance lessened somewhat, simply because Phenotype 2 is somewhat correlated with Gender.
However, rumor has it (see NOTE below) that for this data, “the association signal [of Phenotype 2] is only valid for females”. If this is correct, it means that for any marker containing “the association signal”, the value of Phenotype 2 will change significantly for subjects that are both female and have a variation in the number of minor alleles in that marker. This constitutes an interaction–specifically, between a covariate (Gender) and the marker (column) currently being tested–a “Covariate/Column Interaction”.
We will now test if there is such an interaction taking place.
NOTE: If you wish, you can click on the first navigator node of the project (SNP_GWAS_Tutorial) and read the User Notes text box, which will explain how the phenotype data for this project was simulated.
A. The Covariate/Column Interaction Regression
Our original Full-Model-Only test that we have already performed on Phenotype 2 will also be the baseline model for this test.
We will run this test with Gender as an interaction covariate, rather than as a correcting covariate.
- Open the Filtered Data for Association Testing – Sheet 2 spreadsheet and make sure Phenotype 2 is still dependent.
- Go to Genotype > Genotypic Regression Analysis.
- This time, select Regress on covariate-column interactions (on 384255 genotypic cols). The upper-right box on this tab will become enabled and change its appearance slightly.
- In this upper-right box, now called Covariate-Column Interactions (Full Model), click Add Col Interaction. Check Gender and click Add. The first tab should now look like Figure 7-1.
Notice, in Figure 7-1, that the upper-right and lower-right boxes indicate what variables will be in which parts of this regression. The Full Model will encompass, along with the reduced-model covariates, the interaction of Gender:F (1 for female, 0 for male) with the current column’s data. The reduced model will encompass (along with the intercept) both Gender:F by itself and the current column’s data by itself.
- The second and third tabs (Output Parameters and Genotypic Parameters) should still look like Figures 6-3 and 6-4. Click Run.
This should take a tiny bit longer than the regression that corrected for Gender.
- Right-click on the bottom of the results spreadsheet window and rename it to Genotypic Regression Results – CCI.
- Scroll this spreadsheet over to the FDR column (this time it’s Column 26) and right-click on the column header of FDR. Select Sort Ascending.
We see that some regressions were not successful, probably because for these markers, all the genotypes for females were homozygous major-allele, and thus, for these markers, the interaction variables were all zeroes.
NOTE: You can explore the above statement for yourself for the first unsuccessful marker listed, SNP_A-2130115, as follows:
- Go to the Filtered Data for Association Testing – Sheet 2 spreadsheet and sort on Gender. Note that there are 144 females.
- Select Edit > Find and look for SNP_A-2130115.
- Note that all the genotypes in the first 144 rows are G_G except for one missing value (?_?) genotype.
- In the additive genetic model (which is being used here–see Figure 6-4), G_G corresponds to zero, since G is the major allele. So the interaction for every subject, male or female, will either be zero because of being G_G, because of being male, or both.
Let us look at the successful results.
- In the results spreadsheet we have been examining (Genotypic Regression Results – CCI), scroll down to and past row 84.
We see that there are 17 markers with excellent False Discovery Rate values (less than 0.0002) (with 16 of these having FDR less than .00001) and 67 more with good False Discovery Rate values. This is promising.
- Scroll the spreadsheet horizontally to Column 1.
We see that all the p-values of these 84 markers look excellent. Let’s see how the best 17 stack up against the p-values from our other tests.
- Go back to the Genotypic Regression Plot, and in this window, go to the Plot Tree and click on the first -log10 Full-Model P node. In the Add tab under Controls, click Add Plot Item(s).
- Select the Project button (if it is not already selected). Choose the Genotypic Regression Results – CCI spreadsheet, select -log10 FvR Model P, and click Plot and Close.
- In the Plot Tree, the first -log10 FvR Model P node should already be selected. In the Style tab under Controls, click on the blue square and change the color to pink. The result should look something like Figure 7-2.
We see that the interactions are obtaining some of the best significance that we have found so far for any markers. (As a side note, we see that the significance for a number of other (less-significant) markers becomes much greater when testing them for interaction with Gender.)
- Now, in the Chromosome selection drop-down (near the upper-left-hand corner of the plot window) select All to reset the zoom.
We see that there are no markers that, by testing their interactions with Gender, become more significant than the interactions with Gender of the 17 markers we have been studying in detail.
B. Other Possible Studies
You have now performed a cursory genome-wide association study on a case/control phenotype and a second genome-wide association study using regression on a quantitative phenotype. You may wish to try out other, possibly more challenging, analyses, such as running other association tests, running other variations of regression, or testing on the other phenotypes.
As mentioned in the NOTE at the beginning of this section, more information about each phenotype in this tutorial project can be found by clicking on the first node (SNP_GWAS_Tutorial) in the Project Navigator and reading the User Notes text box.