Welcome to the SVS Runs of Homozygosity Analysis Tutorial!
Updated: January 28, 2021
Version: 8.7.0 or higher
Dr. Todd Lencz, Associate Director of Research at The Zucker Hillside Hospital, working in collaboration with Dr. Christophe Lambert of Golden Helix, has developed a novel analytic approach that first identifies patterned clusters of SNPs demonstrating extended homozygosity (runs of homozygosity or “ROHs”) and then employs both genome-wide and regionally-specific statistical tests for association to disease. This approach can identify chromosomal segments that may harbor rare, penetrant recessive loci.
Using a simulated dataset, this tutorial will lead you step-by-step through the workflow for finding runs of homozygosity outlined in Dr. Lencz’s paper.
This tutorial will not cover the data importing and quality control procedures Dr. Lencz employed in his study, most of which can be done with SNP & Variation Suite (SVS). To learn more about these procedures, please refer to the manual.
To complete this tutorial you will need to download the following:
We hope you enjoy the experience and look forward to your feedback.
1. Overview of the Project
- From the main SVS welcome screen, select File > Open Project. Navigate to the Runs of Homozygosity Tutorial folder you just downloaded and select the Runs of Homozygosity Tutorial.ghp file. Click Open to finish.
The project should already have three spreadsheets.
The first is the 500K HapMap – Sheet 1 with mapped genotypic information for the 270 HapMap samples (Figure 1-1).
The second spreadsheet is the Phenotype Dataset – Sheet 1. It has phenotypic information including a Case/Control column and a Population column (Figure 1-2).
The final spreadsheet, B Allele Frequencies – Transposed – Sheet 1 (Figure 1-3), contains B Allele Frequencies for each SNP in the HapMap data. This spreadsheet, unlike the genotypic and phenotypic information, something you should have before any analysis, was obtained separately by importing Affymetrix CEL files, then using the Affymetrix B Allele Frequency Calculation add-on script.
The process to calculate B Allele Frequencies is somewhat long and complicated; you can read more about the script in its documentation. Also, the CNV Tutorial provides more information about obtaining this spreadsheet from your own data. This spreadsheet will be explained further when it is used to examine loss of heterozygosity at the end of the tutorial.
Versions of ROH Available in SVS
There are two versions of the ROH tool available in SVS. One is specific to microarray SNP data, which this tutorial will run using Genotype > Runs of Homozygosity for GWAS.
There is another ROH dialog under the DNA-Seq menu, called Runs of Homozygosity for NGS. To run this feature, the spreadsheet’s marker map must contain a Reference allele field.
Below is an example of what the NGS data may look like. (This data is not included with the tutorial dataset.)
The menu for the NGS version varies slightly from the menu for the GWAS (microarray) version (which we will be using later in this tutorial).
The differences between the menus include the option of how missing genotypes are treated. With the NGS version, the options allow for treating missing genotypes as homozygous reference calls or keeping missing genotypes missing. The GWAS version always treats these as missing. Also, the options for the maximum gap between SNPs in a run and the maximum density of run are not present in the NGS version. These options are not included in that version because NGS data can be more sparse than GWAS data.
2. Identifying Runs of Homozygosity
- To identify runs of homozygosity, open 500K HapMap – Sheet 1 and select Genotype > Runs of Homozygosity for GWAS. The Runs of Homozygosity window will appear.
- Choose the Distance: radio button under Minimum run length: and enter 1500. Leave 25 in the second box as the min # SNPs:.
- Select the Allow runs to contain up to… radio button under Heterozygotes and enter 1. Check the two checkboxes under this. Enter the values 5, 10, and 5 in the boxes.
- Select the Allow runs to contain up to… radio button under Missing Genotypes and enter 5 in the box. Check the …consecutive missing genotype(s) option and enter 3 in the box.
- Check Maximum gap between SNPs in a run and enter 100.
- On the Output Options tab, select everything under Output Runs.
- Check Output Cluster of Runs for Association Studies and enter 10 for the # samples in a cluster.
Once dialogs look like Figures 2-1 and 2-2, click Run to start the algorithm.
NOTE: For sparse data, it is recommended that the minimum SNP density on the first dialog should be adjusted. This value could be computed by going to Genotype > Quality Assurance and Utilities > SNP Density.
The algorithm will begin by sweeping through the data row-wise for each sample and then internally create a binary matrix. It then runs through the binary matrix column-wise, looking at each SNP’s corresponding binary value to determine whether or not a given SNP falls in a common ROH.
Within each column the algorithm counts how many samples have a ‘1’ for each SNP. For this example, to define a cluster of SNPs that fall in ROHs, there must be at least 10 samples with a ‘1’ for each SNP, and runs may include multiple SNPs. That is, a run starts at the first SNP with 10 or more ‘1s’ in the binary matrix and extends until a SNP is found having fewer than 10 ‘1s’.
Another run would begin when another SNP is found with ten or more ‘1s’ in the binary matrix. The algorithm will also create Optimal and Haplotype Similarity clusters. These are created from the original clusters, but are meant to try to narrow down the clusters or split them apart into more narrowly defined and more accurate clusters.
Consider the following abbreviated example of five samples. Let’s say the input parameters are 10 for minimum run length and 3 for minimum # of samples. For each sample, a horizontal run of 10 or more homozygous SNPs are denoted with a ‘1’. The highlighted regions are vertical clusters of at least 3 samples with a ‘1’ in the matrix.
Having completed this matrix, the algorithm then computes the fraction of ‘1s’ within each run for every sample. The example matrix above would produce the table below as the First Column – Cluster of Runs … spreadsheet where the label ROH1 will be the first SNP label in the first ROH.
Examining the Results of the Analysis
As a result of running Runs of Homozygosity, sixteen spreadsheets are created and explained below.
- Close all of the spreadsheets from the Project navigator by selecting Window >Close All.
- Open First Column – Cluster of Runs….
The First Column – Clusters of Runs… spreadsheet (Figure 2-3) is analogous to the example table above. It contains the first SNP of each cluster where common ROHs were found. Each column represents a cluster of SNPs labeled by the first SNP’s name. Each row represents a sample from the HapMap spreadsheet. Each column in the spreadsheet shows the fraction of SNPs in the cluster that are members of common ROHs for each sample.
Compared to the Every Column – Cluster of Runs… spreadsheet, which contains every SNP in the run, this format is better for association testing as there is a reduction in the total number of tests, and therefore, fewer multiple testing corrections are required.
Notice column one, which indicates that 10 or more of the 270 samples in the dataset had a run of at least 25 SNPs of at least 1500kb in length. This cluster begins at SNP_A-2076482 and if you click on the green MAP button in the top-left corner you will see that it starts at the physical position 35100862 on Chromosome 1. Keep this spreadsheet open.
- Now open the Cluster of Runs… spreadsheet (Figure 2-4).
This spreadsheet displays the clusters identified by the previous spreadsheet. Take notice of how there are 24 rows labeled by Cluster ID. These 24 rows correspond to the 24 columns in the First Column – Cluster of Runs… spreadsheet.
This can be verified by clicking on the green Map button in the top-left corner of the First Column – Cluster of Runs… spreadsheet and comparing the chromosome and start position to each cluster in the Cluster of Runs spreadsheet.
There will also be a similar set of spreadsheets for the Optimal First and Haplotype Similarity clusters.
Comparing the First Column output between the original clusters and the Optimal clusters, there are more 1’s in the columns of the Optimal clusters, meaning the columns overlap with the included markers better than they do for the original clusters.
Comparing the two spreadsheets, the overlap seems better in the Optimal spreadsheet.
- Close all these spreadsheets.
- Open the Homozygous Runs… spreadsheet (Figure 2-6).
This spreadsheet is similar to the Cluster of Runs spreadsheet but contains details of all the ROHs found, not just the ones common among at least 10 samples. Displayed is information for the chromosome, start and end positions of the runs in terms of both physical position in the genome (Start Position) and column number from the marker mapped spreadsheet (Start Index), the run length in number of SNPs and base pairs, the number of missing genotypes in the run and the number of heterozygous markers. Each row represents one ROH and is labeled by the sample name. There may be more than one row for any sample if that sample contains more than one ROH.
- Open the Every Column – Cluster of Runs spreadsheet (Figure 2-7).
- Create a heat map of this spreadsheet by choosing GenomeBrowse >Heat Map.
This information is analogous to the First Column – Cluster of Runs spreadsheet, but has information repeated for each SNP in every cluster. This format is best for visualization of the data in a heat map.
- Within the plot select the Heat Map node in the Plot Tree. Then choose the Display tab on the Controls dialog. We will want to create a 2 color heat map.
- Uncheck the Auto-Compute Color Values option.
- Delete all three of the middle color options by right clicking and selecting Delete
- Change the value for the top option to 0 and the value for the bottom to 1 by right clicking and selecting Edit.
- Click on the top color box to change it to white and bottom color box and change it to green. This plot shows you where the clusters are located on the genome (Figure 2-8).
We can compare the optimal and original heat maps by plotting the Optimal Every Column – Clusters of Runs spreadsheet in the same GenomeBrowse window.
- Click the Plot button in the upper left corner of the GenomeBrowse window.
- Then click the Project button on the resulting dialog.
- From the project select the Optimal Every Column – Clusters of Runs spreadsheet
- Select Heat Map under the plot options
- The dialog should look similar to Figure 2-9. Click Plot & Close.
The optimal clusters will have smaller bounds than the original clusters, since this method tries to split clusters into smaller groupings that contain more similar SNPs than before.
- On the new heat map adjust the color scheme to match the Every Column… results.
- Zoom into a region of Chromosome 7 to see the differences by entering the following (7: 106,977,699 – 129,397,700) into the address bar at the top of the GenomeBrowse window.
- Close this plot and spreadsheet.
- Open the Incidence of Common Runs per SNP spreadsheet (Figure 2-11). This spreadsheet contains the number of runs that each SNP is included in, along with the column number that corresponds to the columns from the marker mapped spreadsheet. The chromosome in which the SNP is located is also included.
- Close this spreadsheet.
- Now open the last spreadsheet, Binary ROH Status… (Figure 2-12). This spreadsheet is the binarized version of the HapMap data, with a 1 corresponding to a homozygote and a 0 to a heterozygote as in the illustrative example above. This spreadsheet is also useful for heat map visualization.
- Select GenomeBrowse > Heat Map and change the color scheme as you did before.
This plot (Figure 2-13) differs from the previous heat map because it shows all homozygous runs, not just the ones found in clusters among samples. Also take notice that there is more white space along the bottom of the map (Sample index > 180).
In the HapMap data, the large sample indexes correspond to the African (YRI) population. Africans are known to be more diverse genetically and therefore contain less homozygosity on average than the other HapMap groups.
Since we are testing for association based on homozygosity, it is important to consider population stratification in your own analysis, as we will in this tutorial. Close this plot and spreadsheet.
3. Perform Regression with ROH Covariates
At this point we will do regression analysis using the fractions provided in the First Column – Cluster of Runs spreadsheet as covariates. First we need to merge this with the phenotype spreadsheet.
- Open Phenotype Dataset – Sheet 1 and select File >Join or Merge Spreadsheets.
A Navigator Window Chooser window will appear prompting you to select a spreadsheet to join with.
- Select the First Column – Clusters of Runs… spreadsheet and click OK.
- Within the Join or Merge Spreadsheets dialog, enter the New Dataset Name: Pheno + First Column – Cluster of Runs.
- Choose to create the Spreadsheet as Child of: Current Spreadsheet, leave the rest as defaults and click OK.
A new spreadsheet (Figure 3-1) will be created containing case/control status, population and the common ROH covariates.
You can now perform association using Numeric Association Analysis or Regression Analysis. We will choose the Regression Analysis in order to control for population stratification in the second step.
- Left-click the Case/Control column header to set the column as the dependent variable (magenta).
- Select Numeric >Numeric Regression Analysis and choose the radio button Regress on each of the 24 numeric columns under Selection Parameters.
- Next select the Output Parameters tab and check Output data for P-P/Q-Q plots (Output –log10(P) will be checked automatically). Your window should match Figure 3-2. Click Run.
A Regression Results spreadsheet will appear.
- Right-click on the –log10 Full-Model P column header (2) and select Plot Variable in GenomeBrowse. The resulting plot (Figure 3-3) will show association on chromosome one. As mentioned earlier, population is a possible confounding factor, so we should control for it in our analysis.
- Close the plot and the regression results spreadsheet.
- Open the Pheno + First Column – Cluster of Runs – Sheet 1 spreadsheet and select Numeric >Numeric Regression Analysis.
- Under Regress on each of the 24 numeric columns check Correct for covariate(s). The Reduced Model Covariates section will become active.
- Click on Add Covariate and select Population from the list. Click Add and then Close. Verify that your window matches Figure 3-4 and click Run.
- Now, to compare the two sets of regression results, open the previously created plot Plot of Column -log10 Full-Model P from Regression Results.
- Select File >Plot, then click the Project button. Select the second Regression Results spreadsheet and check-mark the –log10 FvR Model P. Click Plot & Close.
The plot (Figure 3-5) shows that population stratification explains some of the variation in our data. However, even after controlling for the confounding variable, we still see significant association.
4. Examine Loss of Heterozygosity
We can confirm the validity of the run detection algorithm by examining loss of heterozygosity. In fact, runs of homozygosity (calculated on genotypes) can serve as a proxy for loss of heterozygosity (calculated on B allele frequencies).
The B Allele Frequency spreadsheet in the project represents the B allele frequencies calculated for each marker by the Affymetrix B Allele Frequency Calculation script. This script first combines quantile normalized SNP A and B probe intensities for each marker into a theta value. It then calculates B allele frequencies for each marker. If we locate a run of homozygosity in the heat map containing all ROHs, we can confirm that the algorithm worked correctly by plotting the B allele frequency for the same sample that contained the run.
- Open the Heat Map of Binary ROH Status… and notice the long run on Chromosome 1. For a better view, double click on Chromosome 1 in the Full Domain View to zoom into Chromosome 1.
- Click on the long green run to find out which sample it belongs to (it may be easier to zoom into the run to ensure you click on the correct sample). The sample information will appear in the Console window in the bottom left corner. The the data console will contain some information about the sample. Make sure you’ve selected sample NA12874.
Now we can add a plot of the B allele frequencies for this sample.
- Go to File > Plot, then click the Project button and select the B Allele Frequencies – Transposed – Sheet 1.
- In the Search: box, enter the sample number you identified earlier (NA12874). Check the appropriate sample and click Plot & Close.
If you zoomed in to select the sample, zoom back out to Chromosome 1 by double clicking on it in the Full Domain View. Your plot should look like Figure 4-1.
B Allele frequencies (y-axis) close to 1 or 0 correspond to homozygous SNPs, whereas a frequency closer to .5 suggests heterozygous SNPs. Notice over the same section as the long run from the heat map (essentially the whole q arm of Chromosome 1), the B allele frequency plot contains relatively few markers in the middle. Alternatively, where there are no runs of homozygosity for this given sample (p arm of Chromosome 1), the data are reasonably distributed between 0, .5 and 1.