In the paper Runs of homozygosity reveal highly penetrant recessive loci in schizophrenia, Todd Lencz, Ph.D. introduced a new way of doing association testing using SNP microarray platforms. The method, which he termed “whole genome homozygosity association”, first identifies patterned clusters of SNPs demonstrating extended homozygosity (runs of homozygosity or “ROHs”) and then employs both genome-wide and regionally-specific statistical tests on ROH clusters for an association to disease. This approach differs from single SNP or haplotype association methods employed in most genome-wide association studies and may be better for identifying chromosomal segments that harbor rare, penetrant recessive loci. Another advantage of this approach is that it reduces the number of covariates and therefore multiple testing corrections for genome-wide significance.
The original algorithms for Lencz’s study were co-developed by Christophe Lambert, Ph.D., Founder of Golden Helix, and later implemented in SNP and Variation Suite (SVS).
But, like all good science that gets adopted by the community, some other papers and programs have added to the original ROH method. An especially significant contribution is the addition of more advanced filtering criteria for ensuring that algorithmically detected ROHs are true population variants and not due to random chance, the result of genotype calling anomalies, or low marker coverage over certain regions of the genome. This is the first area we looked at for improving our method.
Improved filtering ensures more accurate ROHs
Homozygosity, by definition, is the lack of difference between the alleles at a given loci inherited from the father and mother genomes. The detrimental effects of parental relatedness on offspring fitness in plant and animal genetics has been well studied [1], including its manifestation in disorders and disease traits such as blood pressure and LDL cholesterol. Long runs of homozygosity in study populations have, therefore, proven informative as covariates for association analysis with desired phenotypes.
We want to ensure then that detected ROHs are not a chance event or an artifact of low SNP density. To do that we first need to simulate the probability that by chance we would get an ROH in a sample of a certain length. We must also take into account that ROHs are influenced by linkage disequilibrium (LD) within a chromosomal region, which in turn is affected by the recombination rates and population history. In humans, extensive studies of LD show that smaller ROH runs should be common throughout the genome [2]. Finally, given a specific genotype platform and the average density of its markers in covering the genome, we want to exclude potential ROHs that are in sub-average areas of density.
Moving window? You don’t need no stinking moving window!
Wanting to improve our implementation of the ROH analysis method to include these new detection and filtering parameters, I naturally looked at the published works and their implementation of the algorithm. While one publication [3] set up a Hidden Markov Model with transition functions based on average recombination rates (yuck!), the rest (other than publications citing Golden Helix) seemed to use the PLINK algorithm.
So I first took a look at how PLINK solved the problem of satisfying these filtering constraints while examining a reasonably extensive search space. (If you think about it, an ROH could begin at any marker and have any arbitrary length; that’s a lot of potential runs!).
Focusing on improving speed performance, PLINK starts with a moving window that is relatively small (50 SNPs by default) and moves it across each sample’s genotypes determining whether the window was “homozygous enough” given user-specified parameters of the number of allowed missings and heterozygotes. (PLINK does suggest an option to define the window in base pairs, but it doesn’t seem to be implemented.) PLINK then stitches these small windows together into actual runs by looking at markers that have a user-defined percentage of samples that have homozygous runs.
This method results in a couple of undesired effects:
- ROHs detected by PLINK have a few homozygous markers on each end of a run that are not counted.
- User-specified constraints on missings and heterozygous markers do not hold on runs (only on those intermediate moving windows). This means you may have specified that you only wanted to allow 1 heterozygote, but you’ll get runs with 4!
- This problem cannot be solved by increasing the window size because you can’t detect ROHs smaller than the window size.
Now, I don’t mean to imply that the design of this algorithm is trivial. The PLINK method does find ROH runs, but not in a manner that is as accurate and precise as desired. Moreover, unfortunately, reading the papers that use the PLINK algorithm, it is not apparent that the authors realize the trade-offs (and inaccuracies) of this approach.
The Golden Helix ROH algorithm, on the other hand, considers the allowed heterozygotes and missing parameters as strict to the whole run. It considers every potential run starting at every marker that meets the size and density requirements, and does not need to stitch, or otherwise compromise, the quality of the run detection, all while maintaining speed performance equivalent to or even better than PLINK’s algorithm.
Parameter selection
With the release of SVS 7.3, we offer an updated implementation for detecting ROHs and clusters of ROHs in your study. It’s proven to be faster and more accurate than other implementations including our previous one. However, as they say, “With great power comes great responsibility,” and now users have the power and responsibility of tweaking quite a few parameters in the algorithm. One would not want to pick blindly, or even worse, introduce an unaccounted for multiple testing bias by tweaking parameters continuously to get the desired outcome. Instead, let’s walk through how to pick an informed set of parameters that will adequately test your hypothesis.
Since the new ROH detection algorithm will try to pick the largest runs available in a chromosome, we merely need to provide constraints on the properties of the runs, including a minimum size threshold. The first choice is whether to make that threshold based solely on a minimum number of SNPs (as was done in the original Lencz paper) or based on the genetic distance between the first and last SNP of the run (including a minimum number of SNPs requirement). As we discussed earlier, the pervasiveness of LD ensures that we can expect many small ROHs in human populations. A survey of the length of stretches of loci in strong LD (also referred to as a haplotype block) across the human population shows that very few will reach over a hundred kb in length [2]. You could use the haplotype block detection algorithm in SVS to do your survey of block length in your study population. Given this information, the default 500kb minimum is a very conservative threshold which you may want to adjust down closer to 100kb.
Given that this method is applied using genotype calls made from microarray platforms, we have to take into account the possibility of imperfections of the underlying platform. Whereas it’s unlikely genotyping errors would result in spurious long runs of homozygosity, it is possible that a missed call would result in a heterozygous SNP for what is a homozygous marker as part of a contiguous ROH. Also, given the call rate statistics of your data, you may expect some uncalled genotypes to show up in your ROHs. If you have already used QA procedures to filter out SNPs with low call rates, you may be comfortable allowing any number of missing genotypes.
Finally, we need to consider the possibility of lower density marker coverage and pure chance producing unwanted ROHs. The first possibility can be covered by merely disallowing ROHs with an SNP density much lower than that of your genotyping platform. The SNP Density SVS script will report on the average SNP density as well as the minimum, maximum, and average gap between markers of your genotyped data. The second way is to ensure that no two makers in a run have a gap more significant than a reasonable threshold. With these options, we can be reasonably assured that the ROHs detected in the genotype data accurately reflect full sequence genotypes.
A note on clusters. The clustering algorithm used to build covariates for association testing simply needs to know the minimum number of samples needed to form a cluster of ROHs. The size requirements used to define the ROHs will also be applied when searching for ROH clusters. You may want to consider adjusting the minimum sample size to be proportional to the size of your study. On the other hand, the Lencz paper found clusters in just 10 or 11 samples that were highly associated with Schizophrenia.
Conclusion
Well, I’ve covered the science behind the Runs of Homozygosity method, the new set of defining and filtering techniques we integrated into our algorithm, the uncompromising speed of our implementation, and some discussion to help you decide on the best parameters for your study. I hope you find it useful and accurate in your quest for significant findings! …And that’s my two SNPs.
[1] Keller L., Waller D. Inbreeding effects in wild populations. Trends Ecol. Evol. 2002;17:230–241.
[2] Wall J.D., Pritchard J.K. Haplotype blocks and linkage disequilibrium in the human genome. Nat. Rev. Genet. 2003;4:587–597.
[3] Auton A, Bryc K, Boyko AR, Lohmueller KE, Novembre J, et al. Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Res.2009;19:795–803.
Fascinating!
I believe that SVS algorithm is much better that PLINK. Still, in a case that we allow 2, 3 or even 4 heterozygotes does this algorithm make a difference between heterozygotes in row or not. According to studies long segments can easy be over 16 or more Mb. In this segments we can have around 7000 SNP (that is in my case with 777k bovine SNP chip). If we take in count low rate of genotyping error (0,2%) what means that in ROH of 20Mb with 7000SNP we can have 14 SNP that can be heterzygous by mistake. If we allow only one, are we really estimating ROH properly? If we allow 14 what if some of them are in row? One heterozygous SNP rounded with many homozygous can really be mistake and run should not be split. But 2, 3 or more in row are probably not an error, and run should be splited.
What is your opinion
Regards
Maja Ferencakovic dipl.ing.
University of Zagreb, Croatia
Hey Maja,
You make a very good point. The algorithm as it is implemented does not look for any behavior in the distribution of any of the allowed heterozygous SNPs that are allowed. I understand the argument that 14 hets that are distributed by chance as you would expect genotyping errors to be is different that 14 hets clumped together in the middle of a large segment.
So thanks for the suggestion and I’m going to log it as a feature enhancement. At the moment you would have to filter through ROH calls made with a 14 het allowance to see if they look like valid runs. This kind of inspection would probably be easiest using our genome browser to zoom to the potential ROH and have the hets colored or encoded differently than the homozgous genotypes to get a sense of their distribution.
Dear Gabe,
thank you for your answer, and thank you for considering my suggestion for program enhancement.
Unfortunately, as phd student I can’t afford SVS. I was very impressed with trial.
Maja
Pingback: Runs of Homozygosity Updated | Our 2 SNPs…®