Genotype imputation is a statistical technique for estimating sample genotypes at loci that were not directly assayed by sequencing or microarray experiments. There are several reasons why you might want to use imputation in a research study. For example:
- Improve call rates in GWAS by imputing sporadic missing genotypes
- Harmonize the data content from different GWAS genotyping platforms so that they can be analyzed together in meta-analysis or mega-analysis
- Increase density of genotype calls for fine mapping or to identify candidate causal variants at a susceptibility locus.
These are all important applications for imputation technology and can make significant contributions to a successful study. There is also a fourth application often cited for imputation: Whole-genome imputation and analysis may boost study power and/or produce a novel association that wasn’t found in GWAS analysis of the genotyped SNPs.
The idea that imputation might help you find something that you initially missed is very alluring. We want to use all available resources and methods to extract any information that we can from the data. Despite the allure, it is good to remember that there are real limits to imputation technology. And what if I told you that a relatively simple statistical routine might save you the effort of whole-genome imputation and directly identify the loci that might harbor an “unseen” association in your data?
Imperfect Imputation Input Implications
Imputation results are only as good as the inputs provided. Both primary inputs – the reference panel data and the original genotypes of the study population – are important here. The most popular reference panel in recent years has been the 1000 Genomes Project Phase 1 data. The phenotypes of these samples are not known, but they are presumably healthy people for the most part. The 1000 Genomes sample collection is certainly not enriched for any particular diseases or disease-causing variants. If the disease variant doesn’t occur in the reference panel or doesn’t appear with sufficient frequency for accurate phasing, it won’t be found with imputation.
Some diseases are associated with relatively common variants. The Alzheimer’s Disease ApoE locus is a good example. Such variants may be found with imputation, but only if the variant is sufficiently correlated with observed genotypes in the baseline GWAS data. The ApoE locus can’t be imputed from the Affy 500k chip, for example.
There is a certain irony to imputation. You can’t impute an unobserved variant unless you’ve already genotyped a correlated marker. But if you’ve already genotyped a correlated marker, then you can likely find the association signal without imputing. This is the fundamental nature of GWAS. GWAS methodology is built on the concept of tag-SNPs; if the unseen causal variant is tagged by a correlated SNP on the array, you will hopefully find the association. Imputation can then be used to assist the process of finding the causal variant (see item #3 above). If the causal variant is not tagged well by the GWAS SNPs, then you are unlikely to observe the association signal, and also perhaps unlikely to successfully impute the causal variant. This is why the input genotypes are so important to the success of imputation.
Imputation and Haplotypes
Astute readers will point out that imputation is more complex than a simple 1:1 linkage disequilibrium (LD) relationship between an observed SNP and the unobserved variant; imputation is really about haplotypes. The imputation process, in simplified terms, goes like this:
- Phase the reference panel data to create very long or chromosome-length haplotypes
- Phase the observed sample genotype data to create very long or chromosome-length haplotypes
- Use intersecting markers to identify the reference haplotype(s) most similar to each sample haplotype
- Impute the missing alleles on the sample haplotypes with the alleles observed on the corresponding reference haplotype
- Combine the imputed haplotypes for each sample into diploid genotypes for further analysis.
Imputation provides us with a set of high-density genotypes (or probability-adjusted allelic dosage values) that can be analyzed using GWAS-style test procedures. When you step back and look at the process, you might say that imputation is little more than a computationally-intensive procedure for selecting tag-SNPs. Think about it. We use probabilistic methods to construct haplotypes from low-density data, then compare those haplotypes with high-density data to identify individual SNPs that can be used to represent the haplotypes in association tests.
In order to discover a new association that wasn’t evident in the original GWAS, an associated/causal SNP would only need to be represented on one imputed haplotype. The causal SNP would need to have moderate LD with multiple SNPs genotyped in the GWAS, but no correlations strong enough for those observed SNPs to be significant in the GWAS. It is certainly possible for this to happen. Simulations of this scenario can be created. But you don’t necessarily need to do imputation to find this association—you can probably find it with relatively simple haplotype tests.
The Golden Helix SNP and Variation Suite (SVS) software has extensive functionality to look beyond single-SNP associations in GWAS. These options include haplotype analysis, runs of homozygosity (ROH) analysis, and multiple regression analysis that models the combined contributions of neighboring SNPs. All of these methods have the potential to uncover hidden associations that might not be apparent from testing individual SNPs in a standard GWAS approach.
Our previously recorded webcast, “Getting More from GWAS” reviews some fundamentals of GWAS and the analytic features available in SVS, with particular emphasis on haplotypes, ROH, and other multi-marker analysis methods.