Welcome to the Bovine GWAS with Mixed Linear Model Tools!
Updated: January 9, 2019
Version: 8.7.0 or higher
The following tutorial is designed to systematically introduce you to a number of techniques for genome-wide association studies. The dataset included in the tutorial is from a Bovine HapMap dataset genotyped using an Illumina Array. For the association study we will be using the Mixed Linear Model tool that is available in SVS.
This tutorial is not meant to replicate all the workflows you might use in a complete analysis, but instead touch on a sampling of the more typical scenarios you may come across in your own studies.
All phenotype data is simulated.
To follow along you will need to download and unzip the following file:
We hope you enjoy the experience and look forward to your feedback.
1. Sample and Marker Quality Assurance
In the following sections we will filter the data removing samples with poor call rate as well as filter markers for call rate, number of alleles, and minor allele frequency. Then we will finish up with LD Pruning on the data in preparation for cryptic relatedness and population stratification checks.
- From the SVS Welcome screen go to File > Open Project and navigate to the downloaded project and select the Bovine_GWAS_Tutorial.ghp file. The project should already contain two spreadsheets, Phenotypes and Bovine HapMap Genotypes.
- Select Tools > Current Project’s Options and confirm that the correct genome assembly is selected. The project for this tutorial should have already been pre-set to use the Bos taurus (Cow), ARS UCD1.2 (Apr 2018) genome assembly. This option primarily affects the genomic view and which data sources are available in the integrated GenomeBrowse plotting interface.
NOTE: For more details on creating and using genome assemblies in SVS, please see the Creating and Using Genome Assemblies in SVS Tutorial.
- Open the Bovine HapMap Genotypes spreadsheet and choose Genotype > Quality Assurance and Utilities > Filter Samples by Call Rate.
- Select to Drop if call rate <= 0.99. Dialog should look like Figure 1-1.
Four samples should have been inactivated due to poor call rate, and a subset spreadsheet called Subset – Samples with Call Rate > 0.99 should have been created.
- From the Subset – Samples with Call Rate > 0.99 spreadsheet go to Genotype > Genotype Filtering by Marker.
- Select to Drop if call rate < 0.85, Drop if number of alleles > 2, and Drop if Minor Allele Frequency (MAF) < 0.01. The dialog should look like Figure 1-2. Click Run.
The results of the tool should be one spreadsheet, Filtering Results, which will contain two columns for each filtering criterion–one column with the original dataset’s values for that criterion, and the other a binary column stating whether each marker met that filter criterion. Additionally, an overall Drop? column will be output (at the beginning) that has a 1 for every marker that will be filtered from the original dataset.
In the original genotype spreadsheet you will see that now some of the markers in the dataset have been inactivated. In particular, 3,567 columns were inactivated leaving 49,323 markers active.
- Create a subset spreadsheet by going to Select > Subset Active Data.
- Rename the spreadsheet to High Quality Samples and Markers by right-clicking on the spreadsheet node in the project navigator and selecting Rename.
LD Pruning of Markers
Some tests such as Identity by Descent Estimation (IBD) and Principal Component Analysis (PCA) will obtain better results if the markers used are not in linkage disequilibrium with each other. So, before proceeding to the next sections of the tutorial we will LD Prune the markers.
- Open the High Quality Samples and Markers spreadsheet and go to Genotype > Quality Assurance and Utilities > LD Pruning.
- Leave the default options and click OK.
NOTE: The options for pruning your markers are specified in the options dialog with the default options being the most common choices for basic pruning of the data. However, if your marker data is dense in some areas or you see large blocks of moderately high LD in other areas that you would like reduced, changing these default options may improve your results. See Determining the best LD Pruning options blog for assistance in selecting the best options for your data.
This LD Pruning step will take a couple of minutes to run. Upon finishing, 5,734 markers will be inactivated as designated by the grayed out columns in the spreadsheet. You will be using only the active columns (non-correlated markers) for autosomal chromosomes to perform IBD and PCA, so create a column subset spreadsheet.
- Choose Select > Column > Column Subset Spreadsheet.
- In the Project Navigator, rename this node to Pruned SNP Subset.
- Close all open spreadsheets by going to Window > Close All.
If your dataset contains non-autosomal chromosomes, you will want to inactivate the markers in these chromosomes before proceeding to IBD and PCA analysis.
- To inactivate these markers from the subset spreadsheet (if necessary), choose Select > Activate by Chromosomes, uncheck the non-autosomal chromosomes (ex. X, Y, XY, Z, MT) and click OK.
2. Cryptic Relatedness
Next we will find and filter samples determined to be “related” to other samples. Relatedness is often defined as family-relatedness, but identity by descent (IBD) estimation can also detect duplicate samples, duplicate samples from one of a pair of genotyping chips but not the other, or sample contamination.
- From the Pruned SNP Subset spreadsheet, choose Genotype > Quality Assurance and Utilities > Identity by Descent Estimation.
- For this exercise, check Output IBS distances ((IBS 2 + 0.5*IBS 1)/# non-missing markers) and Output PI = P(Z=1)/2 + P(Z=2). Uncheck all other options, and click Run.
This will take a few minutes. Upon completion, two spreadsheets are output: IBD Estimate: Estimated PI and IBS Distance ((IBS2 + 0.5*IBS1)/# non-missing markers).
IBD Estimate: Estimated PI gives an N x N table where N is the number of samples in the dataset. By plotting a heatmap of this table we can detect patterns showing relatedness. IBS Distance ((IBS2 + 0.5*IBS1)/# non-missing markers) is also an N x N table that reflects the Identity by State (IBS) between pairs of samples, and is the recommended kinship matrix to be used with the Mixed Linear Model Analysis tool we will be using later in this tutorial.
- From the IBD Estimate: Estimated PI spreadsheet, choose Plot > Heat Map (Uniform).
You’ll get the plot in Figure 2-1.
By default, the heat map has a three color scheme calculated automatically. We want to define the color scheme manually based on a two color scheme where we look for sample pairs with a PI estimate of 0.25 or greater (PI of 0.25 = second degree relatives, 0.5 = first degree relatives, and 1 = identical twins (or duplicate samples)).
- Click on the IBD Estimate: Estimated PI node in the Graph Control Interface (upper-left portion of the window).
- On the Color tab choose Manual and then right-click the first option (0), and select Delete.
- Now right click on the first parameter, select Edit, and change it to 0.2. Your plot should look like Figure 2-2.
You can see several rectangular areas in the plot that show a high degree of relatedness in those samples. You can zoom in to see this more clearly by clicking and dragging a red box around one of these rectangular areas (Figure 2-3).
This dataset contains samples from 19 different cattle breeds. Pairs of samples from within the same breed should be more closely related to each other then a sample from one breed and a sample from a different breed. This is confirmed by the fact that the green rectangles you see along the diagonal in Figure 2-2 are areas where both samples are within the same breed.
Also, you might expect that pairs of samples within some breeds would be more highly related than pairs of samples within other breeds. The fact that some rectangles are darker green than others confirms that this is the case for this dataset.
You can see a breakdown of the breed proportions for this dataset by opening the phenotype spreadsheet and creating a pie chart from the breed variable.
- Open the Phenotypes spreadsheet.
- Right-click on the Breed column and select Pie Chart.
3. Population Stratification
The next step is to identify samples that depart from the expected homogenous ethnicity of your study. You can do this by performing principal component analysis on your data and comparing the first two principal components.
There are several ways to perform PCA. Some recommend using the pruned set of SNPs (as was done with IBD), some recommend using a filtered set of SNPs (on minor allele frequency and HWE for example), and some recommend using the entire SNP set. There are advantages to each. Here we’ll use the pruned set of SNPs we created earlier.
- Open the Pruned SNP Subset spreadsheet and select Genotype > Genotype Principal Component Analysis.
- Under Principal Components, enter 10 for Find up to top ___ components.
- Leave the defaults for the rest of the options and click Run.
Two spreadsheets result from this analysis, namely, the Principal Components (Additive Model) spreadsheet and the PC Eigenvalues (Additive Model) spreadsheet. To find out how many principal components are required to explain the majority of the population stratification, a PCA plot will be created and the eigenvalues will be visually inspected.
Look at the PC Eigenvalues (Additive Model) spreadsheet. Notice that there is very little change between the seventh, eighth and ninth eigenvalues, implying that about 6 principal components explain the majority of stratification in the SNP data.
You can visualize the population stratification by plotting the first few principal components against one another. We suspect that the breed structure in the sample will explain the stratification we are seeing. To confirm this assumption, we will be coloring the PCA plot by breed. To be able to color by breed we will need to first join the Phenotypes spreadsheet with the Principal Components spreadsheet.
- Open the Phenotypes spreadsheet and select File > Join or Merge Spreadsheets.
- Select the Principal Components (Additive Model) spreadsheet.
- In the Join or Merge Spreadsheet window select the Current spreadsheet radio button under Spreadsheet as Child of, leave the rest of the parameters as their defaults, and click OK. The combined spreadsheet should look like Figure 3-1.
It is now possible to plot one component against the other and color-code each sample or data point according to its respective breed.
- From the Phenotypes + Principal Components (Additive Model) – Sheet 1 spreadsheet, select Plot >XY Scatter Plots.
The XY Scatter Parameters dialog appears with two list views. The list view on the left is for selecting the column (principal component) to represent the independent or X axis. The list view on the right is for selecting a single or multiple columns (principal components) to represent the dependent or Y axis.
- In the left list box select EV = 31.2575. In the right list box check EV = 9.00414 and click Plot.
If we color each data point according to its respective breed, the clusters become more obvious.
- In the Graph Control Interface in the upper-left pane of the Plot Viewer, select the Item EV = 9.00414.
- Select the Color tab and select the By Variable radio button
- Click Select Variable and select Breed from the list. Click OK.
You can see in the plot (Figure 3-2) that there are about 5 to 6 different clusters of samples, depending on whether you believe the Romagnola breed is on its own or a part of the larger group that includes the Piedmontese breed. This confirms our assumption when examining the eigenvalues that about 6 principal components explain the majority of the stratification in the data.
- When finished, close the Plot Viewer and rename its associated node (under the Phenotypes spreadsheet) in the Project Navigator to PCA Plot.
4. Preparing Data and Running Analysis
Preparing the Dataset
To begin analysis we must first join together all the necessary spreadsheets.
- Open the High Quality Samples and Markers spreadsheet and reactivate all markers by going to Select > Activate All.
- This will create a new spreadsheet called High Quality Samples and Markers – Sheet 2. From this spreadsheet go to File > Join or Merge Spreadsheets and select the Phenotypes spreadsheet.
- Set the New dataset name: to Phenotype + High Quality Data and choose Project root under the Spreadsheet as Child of option. Leave the rest of the options as default and click OK.
Running the Analysis
- Open the Phenotype + High Quality Data spreadsheet and left-click the Phenotype1 column label header. This will turn the column magenta denoting the column as the dependent variable.
- Choose Genotype > Mixed Linear Model Analysis.
Through the quality assurance parts of this tutorial, we have learned that the dataset has a lot of cryptic relatedness, as well as population stratification due to the breed structure. For this type of data, the Mixed Model approach is a perfect fit, as we can adjust for relatedness with the random effects component of the model using a kinship Matrix (IBS), and also include additional fixed effects (Breed) using the covariates options.
- On the MLM Parameters tab, check the Mixed Model GWAS group box and within these options select both Single-locus mixed model GWAS (EMMAX) and Multi-locus mixed model GWAS (MLMM) analysis options.
- Check the Use Pre-Computed Kinship Matrix (Cov. Matrix of Random Effects), and then click Select Sheet. Choose the IBS Distance… spreadsheet to be used as the kinship matrix.
NOTE: For this tutorial we have selected to use the IBS matrix for our kinship component, as that is what was recommend by the authors of the EMMAX method. Other analysis tools may recommend using the IBD matrix or even a GRM computed from another source (ex. GBLUP Genomic Relationship Matrix). The authors of the EMMAX method described why they believe the IBS matrix is the best choice for their method in a paper available on PubMed here.
- Check Correct for Additional Covariates and then Add Columns. Choose the Breed column from the spreadsheet.
- Under the Additional Outputs tab, make sure the False discovery rate (FDR), Output data for P-P/Q-Q Plots, Output -log10(P), and Allele Frequencies options are checked, and uncheck all other options.
- Once the dialog looks like Figures 4-1 and 4-2, click OK.
It will take a few minutes to finish as the Multi-locus portion needs to go through several iterations of the EMMAX algorithm. Once it is finished you should get four resulting outputs from the analysis. We will look at each of these in the following sections.
5. Single-Locus Mixed Model Analysis
The P-Values from Single-Locus Mixed Model output is from a single EMMAX run that tested each active marker in the dataset. Significant p-values can be plotted in a Manhattan plot, and Q-Q Plots can be created from the expected values to test the fit of the model.
- To create a Manhattan Plot right-click on the -log10(P-Value) column and select Plot Variable in GenomeBrowse.
- In the Plot Tree window, click either of the two -log10(P-Value) plot nodes, and under the Style tab of the Controls window select Style By: Chromosome.
You can zoom into the peak visible in Chromosome 17 by selecting the magnifying glass icon on the zoom toolbar, then pressing your left mouse button to the left of the peak, and while holding down your left mouse button, dragging the cursor across to the right of the peak, finally releasing the mouse button to complete the zoom. Repeat this process until you have the visibility you prefer.
As soon as you get in far enough you should start to see the names of the markers become visible in the plot. For this dataset we have two markers (HapMap57042-rs29016514 and ARS-BFGL-BAC-36625) that are highly significant in this region, with many supporting markers around them.
Since these markers are both behaving very similarly, we suspect that they are in high LD with one another. To test out this theory, we can add an LD plot directly above the p-value plot.
- Click the Plot button in the upper left corner of the GenomeBrowse window.
- On the Add Data Sources dialog that appears, select the Project option on the left and the Bovine HapMap Genotypes spreadsheet in the Navigator Window Nodes. Finally, check LD under the Plot Data options.
- Once the dialog looks like Figure 5-2, click Plot & Close to add the LD Plot to your existing GenomeBrowse window.
As we expected, the LD plot shows a large red area around our markers of interest indicating that they are in high LD with each other.
- Clicking on the dark red rectangle on the LD Plot below the two markers of interest will show the computed LD value for those markers in the Console window.
- To test the fit of the model return to the P-Values from Single-Locus Mixed Model spreadsheet and choose Plot > XY Scatter Plots. Select -log10(Expected P) on the left side and -log10(P-value) on the right side, then click Plot.
- The y = x line can be added to the plot by clicking the Graph1 node in the Graph Control Interface, then on the Add Item tab check the f(x) = mx + b item and click Add.
6. Multi-Locus Mixed Model Analysis
We will use the next three outputs created, which were P-Values from the Multi-Locus Mixed Model, MLMM Step Information and Covariate P-values, and Variance Partition Plot, to examine the Multi-Locus portion of the analysis.
MLMM Step Information and Covariate P-values
For complex traits controlled by several large-effect loci, a single-locus test may not be appropriate, especially in the presence of population structure. Therefore, the MLMM process is available, which is a simple stepwise mixed-model regression that works as follows:
- Begin with an initial model that includes, as its fixed effects, only the intercept and any additional covariates you may have specified.
- Using this model, perform an EMMAX scan through all markers (that you have not specified as additional covariates).
- From the markers scanned above, select the most significant marker and add it to the model as a new fixed-effect covariate (“cofactor”), creating a new model.
- Repeat (2) and (3) (forward inclusion) until a pre-specified maximum number of forward steps is reached or until forward inclusion must be stopped for some other reason (see NOTE below).
- For each selected marker in the current model, temporarily remove it from the fixed effects and perform an EMMAX scan over only that marker.
- Eliminate, from the current model, the marker that came out as least significant using the above test. A new smaller model is created.
- Repeat (5) and (6) (backward elimination) until only one selected marker is left.
The variance components are re-estimated between each forward and backward step, while the same kinship matrix is used throughout the calculations.
NOTE: The following are reasons why forward inclusion might be stopped early:
- When the pseudo-heritability estimate has become close to zero.
- When all of the variance is explained by the fixed-effect covariate markers (cofactors) currently in the model and any covariates you may have specified.
- When the latest significant marker that has been selected is collinear with the fixed-effect covariate markers (cofactors) already in the model and any covariates you may have specified.
- When the model now requires more active samples for analysis than exist in the spreadsheet.
- When the number of active (non-cofactor) markers is no longer enough to permit analyzing the current model.
Let us analyze our Multi-Locus outputs.
- Open the MLMM Step Information and Covariate P-values spreadsheet to see the statistics for each of the steps that were run on this dataset.
The result of this stepwise regression is a series of models. Several model criteria have been explored by the authors of this method to determine how appropriate any of the models is. See Model Criteria for a full description of each criterion.
- Examining the Optimal according to: column for this dataset you can see that the After Step 1 results are most appropriate according to several model criteria.
- The variance of the phenotype for the model resulting from each step of the analysis is summarized in the Variance Partition Plot, as well as in the MLMM Step Information and Covariate P-values spreadsheet.
P-Values from Multi-Locus Mixed Model
From the MLMM Step Information and Covariate P-values output we learned that the After Step 1 portion of the output is optimal.
- Open the P-Values from Multi-Locus Mixed Model spreadsheet and horizontally scroll to the group of columns that begin with P-Value (After Step 1) and end with Manhattan Category (After Step 1). For this example this group includes columns 10 through 18. If more output options had been selected on the Additional Output tab of the analysis dialog, there would have been additional columns for each group.
- Right-click on the -log10(P-Value) column (11) and select Plot Variable in GenomeBrowse.
To color the plot in Manhattan Plot fashion we will use Manhattan Category (After Step 1) (Column 18).
- Click either -log10(P-Value) node in the Plot Tree, and on the Style tab of the Controls dialog pick Manhattan Category (After Step 1).
The difference between using this categorical data and just coloring using a standard Manhattan color schema is that the SNPs added as covariates to the model (“cofactors”) are colored differently from the rest of the markers in the same chromosome, so that they can easily be differentiated. As seen in Figure 6-4 below, marker Hapmap57042-rs29016514 is the covariate that was added to the Step 1 model.
NOTE: You may wish to scroll down to Cofactor in the Style: box and increase the plot point size of Cofactor to 10 or 12 to better distinguish cofactor plot points, since the Cofactor plot point color otherwise may be hard to distinguish from the plot point color for some of the chromosomes.
We now want to examine the results of the analysis to determine if any marker other than Hapmap57042-rs29016514 shows significance, since the object of this method is to look for several large-effect markers working together that affect the trait.
- Open the P-Values from Multi-Locus Mixed Model output and right-click the P-Value (After Step 1) column and select Sort Ascending.
We see from Figure 6-5 that aside from our covariate marker, there are several markers that are marginally significant (p-value less than 1e-4), but that none of them pass multiple testing correction based on the False Discovery Rate (FDR) (Column 16).