Step 3. Identify Runs of Homozygosity
To identify runs of homozygosity select >Genetics >Runs of Homozygosity from the marker mapped spreadsheet created in Step 2. The following dialog window will open prompting you for analysis options.
parameters.
In this dialog you will need to select the Minimum run length in SNPs and the Minimum # of samples that contain a run. The minimum run length is the length a sequence of homozygous SNPs must be to be considered a ROH. The minimum number of samples is the number of samples that must share a ROH for that run to be considered “common”. Two check boxes in the ROH dialog allow for two optional spreadsheets to be created. These optional spreadsheets and the input parameters are discussed in more detail later.
For this tutorial, input 100 for Minimum run length in SNPs, 10 for Minimum # samples that must contain a run, and leave both boxes checked as in Figure 1.
Click Run.
Behind the scenes, the algorithm will begin by sweeping through the data row-wise for each sample and then internally create a binary matrix as follows. In the case of the input parameters above, the sweep searches for runs of 100 consecutive homozygous SNPs (genotype does not matter; AA is treated the same as BB and missing genotypes are considered part of a run). If said run exists, for each SNP within that sequence a ‘1’ is assigned. A ‘0’ is assigned to all other SNPs that do not fall in runs of 100 or more.
The algorithm then runs through the binary matrix column-wise, looking at each SNP’s corresponding binary value to determine whether or not a given SNP falls in a common ROH. Within each column the algorithm counts how many samples have a ‘1’ for each SNP. To define a cluster of SNPs that fall in ROHs, there must be at least 10 samples with a ‘1’ for each SNP, and runs may include multiple SNPs. That is, a run starts at the first SNP with 10 or more ‘1s’ in the binary matrix and extends until a SNP is found having less than 10 ‘1s’. Another run would begin when another SNP is found with ten or more ‘1s’ in the binary matrix.
Example
Consider the following abbreviated example of five samples. Let’s say the impute parameters are 10 for minimum run length and 3 for minimum # of samples. For each sample, a horizontal run of 10 or more homozygous SNPs are denoted with a ‘1’. The highlighted regions are vertical clusters of at least 3 samples with a ‘1’ in the matrix.


Having completed this matrix, the algorithm then computes the fraction of ‘1s’ within each run for every sample. The
above example would produce the following table:
Now consider this algorithm on the data provided for this tutorial. As a result of running Runs of Homozygosity, three
spreadsheets will be created.
The Cluster of Runs spreadsheet (Figure 2) is the primary output of the Runs of Homozygosity algorithm and reflects the
example table above. It contains clusters of SNPs where common ROHs were found. Each column represents a cluster of
SNPs labeled by the SNP names, the number of SNPs found in that run (in parentheses), and the first and last SNP column
numbers from the marker mapped spreadsheet. Each row represents a sample from the marker mapped spreadsheet. The
data contained in the spreadsheet is, for each sample, the fractions of SNPs in each cluster that are members of common
ROHs.
The optional Incidence of Common Runs per SNP spreadsheet (Figure 3) displays columns for the SNP column number in the marker mapped spreadsheet, the number of runs associated with each SNP, and the chromosome number for the SNP. Row labels are the SNP names.
The optional Homozygous Runs spreadsheet (Figure 4) details the ROHs found. Displayed are data for the start and end positions of the runs in terms of both SNP name and column number from the marker mapped spreadsheet, the run length in number of SNPs, the number of missing genotypes in the run, the chromosome number, the start and end physical position on the chromosome, and the length in physical position. Each row represents one ROH and is labeled by the sample name. There may be multiple spreadsheet rows for one subject if there is more than one ROH in that sample’s data.
Open the Cluster of Runs spreadsheet (Figure 2). Notice column one, which indicates that 10 or more of the 48 samples in the dataset had a run of 100 or more SNPs overlapping the common ROH of 129 SNPs between SNP_A-4217094 and SNP_A-1790050 (columns 56490 and 56618 of the marker mapped spreadsheet.
with Sample 0003 ROH highlighted.
You can see how each of the values was computed by investigating the Homozygous Runs spreadsheet. First look at Sample 0003 in the Cluster of Runs spreadsheet (row 3) whose number is ~.9767. This indicates that its ROH of 100 or more SNPs overlapped the common ROH by 97.67%. To see how this was determined Open the Homozygous Runs spreadsheet, sort ascending by Patient ID (left click on the "cell" above Patient ID until the arrow points up), and scroll down to row 270 (Figure 5).
Here you can see this particular run for Sample 0003 started at SNP_A-4193830 (column 56493 of the marker mapped spreadsheet) and ended at SNP_A-2040997 (column 56678 of the marker mapped spreadsheet), totaling 186 SNPs. This means Sample 0003’s first homozygous SNP started three SNPs after the start of the common ROH as illustrated in Figure 6.
In total it shares 126 of the 129 SNPs or ~97.67% of the common ROH.
Now that the ROHs have been identified you can perform association.
TABLE OF CONTENTS |
|
| Introduction |
|
| Import genotype and phenotype data |
|
| Import and apply genetic marker map |
|
| ›› | Identify runs of homozygosity |
| Perform association testing with ROH covariates |
Read the Paper