Genotype imputation is a common and useful practice that allows GWAS researchers to analyze untyped SNPs without the cost of genotyping millions of additional SNPs. In the Services Department at Golden Helix, we often perform imputation on client data, and we have our own software preferences for a variety of reasons. However, other imputation software packages have their own advantages as well. This motivated us to perform some tests to assess certain performance features, such as accuracy and computation time, of a few common imputation software programs.
Study Design
For this comparison, we tested three different imputation softwares: BEAGLE, IMPUTE2, and Minimac. Imputation was performed both with and without pre-phasing the sample data with BEAGLE and IMPUTE2. Minimac is an implementation of the MaCH method that utilizes pre-phasing. We did not run MaCH without pre-phasing due to computational constraints. Pre-phasing is a technique that can significantly improve computation time with a slight accuracy trade-off by phasing the sample data prior to running imputation (as opposed to phasing the sample data during imputation).
The variables measured include imputation accuracy (concordance rates), imputation quality, computation time, and memory usage. Concordance for each SNP is measured by taking the total number of accurate genotypes (comparing the imputed data against the full dataset) over the total number of genotypes or samples. Quality was determined by looking at the per-SNP quality metrics provided by each program. These metrics differed and recommended appropriate thresholds were used separately for each. Computation time was measured based on running each program on a 64-bit Linux computer with 16GB of memory.
The baseline study data included 141 unrelated HapMap samples genotyped on Illumina Omni1, representing the three major HapMap population groups. We imputed these samples based on the 1000 Genomes Phase 1 v3 reference panel as provided on each imputation program’s website. In order to simulate how a researcher would typically perform imputation on their own data, the reference datasets were downloaded directly from each program’s website and were not modified. Each data provider filters the reference data in a slightly different way, so this means that the reference datasets were not identical, even though all were derived from the same original dataset. The sample data was limited to only include SNPs in chromosome 20.
The following Venn diagram represents the overlap of genetic data at the same genomic position between the three reference datasets and the original 1000 Genomes dataset. Therefore, the total number of rows found in each dataset is slightly more than the number displayed on the diagram, since some variants have duplicate positions.
An interesting point to note about this diagram is the existence of markers in the IMPUTE2 and BEAGLE reference dataset at genomic positions that were not found in the original 1000 Genomes dataset. Upon further investigation, most of these could be attributed to one-off position differences with some indels reported in the 1000 Genomes dataset. This demonstrates how different data processing pipelines handle complex genotype information in slightly different ways. For the same reason, the total number of markers at unique positions differs in each version of the reference dataset.
Results
All programs outperformed others in certain areas. Based on all of the metrics measured, IMPUTE2 seemed to perform with the greatest accuracy and quality although other programs performed better in other areas.
As expected, pre-phasing the original dataset drastically improved the total compute time. When the data was pre-phased, IMPUTE2 ran the quickest, followed by Minimac, and then BEAGLE. Without pre-phasing, IMPUTE2 was much faster than BEAGLE.
IMPUTE2 also had superior concordance rates, although all software programs performed well in this area. Minimac had the lowest concordance rate at 96.25%.
Software | Total Compute Time* | Mean SNP Concordance | Total # SNPs | # High Quality SNPs | % High Quality Imputed |
IMPUTE2 | 23 hours | 99.98% | 668,180 | 620,792 | 92.9% |
BEAGLE | 213 hours | 98.43% | 484,023 | 320,991 | 66.3% |
IMPUTE2 with Pre-phasing | 8 hours | 99.92% | 668,180 | 297,196 | 44.5% |
BEAGLE with Pre-phasing | 34 hours | 98.05% | 484,023 | 293,890 | 60.7% |
Minimac | 18 hours | 96.25% | 667,870 | 450,790 | 67.5% |
*includes all steps required
Without pre-phasing, IMPUTE2 had the highest quality imputation, but after pre-phasing, the certainty metric provided in the IMPUTE2 output dropped dramatically (see first figure below). The R^2 accuracy value given by BEAGLE was also lower in the output based on pre-phased data, but the change was not nearly as dramatic (see second figure below).
An unfortunate side effect of IMPUTE2 was the intensive memory usage. IMPUTE2 used all available RAM (16 GB) making it impossible to perform any other tasks. BEAGLE and Minimac, on the other hand, used far less memory (although took longer to finish). BEAGLE was run using the “lowmem” option for more efficient memory usage, which also had the effect of increasing runtime.
All of the 141 test samples are also included in the 1000 Genomes reference panel. We recognize that this may bias the accuracy of the results, but it was acceptable for our purposes. The concordance rates represent how well each imputation program was able to reproduce genotypes for samples where the correct answer was already present in the reference panel. The algorithms used in each program may be more or less appropriate for this situation.
Another metric not discussed previously is the availability of documentation. In this category, BEAGLE wins. Not only do they have a nice PDF manual, we’ve had great success in asking specific questions to the authors and getting thorough responses in a timely manner.
In summary, choosing the most appropriate imputation program to use depends on the qualities most important to the researcher and the hardware available. An important factor in our testing was that we chose to run the entire length of chromosome 20 in a single batch. The performance of the various tools, particularly with regard to compute time, would likely be quite different had we run the imputation in smaller batches.
How large was the reference population and were all imputed samples the same race or a mix?
Hi Matthew, thanks for the question. The reference population included all of the 1092 samples and was thus of mixed race. The sample data also contained samples from each population represented in 1kG.
Hi Autumn. Excellent work, and of great interest since I have used both IMPUTE2 and minimac for different projects. Is it possible to expand on a few?
1) for each imputation what was the total number if input genotyped, and was there a minimum minor allele frequency?
2) what threshold was used for high quality imputation?
3) What was the minor allele frequency characteristics of the ~187k SNPs that are in 1000G but not imputed by any of the programs – am I right in thinking most of them are rare relative to the input genotypes?
Hi Matthew, thanks for your questions!
The same baseline Illumina dataset was used as input into each imputation program and this dataset contained approximately 23K SNPs in chromosome 20. The SNPs were filtered by call rate (> 95%) but not minor allele frequency.
A high-quality threshold value of 0.5 was used for Beagle and Minimac (R^2) while a threshold value of 1 was used for Impute2 (certainty).
You’re spot on in regards to the ~187K SNPs in 1kG. All of those SNPs had at most 1 copy of the minor allele. Both Impute2 and Mach remove monomorphic SNPs and singletons from their reference panels while Beagle used a more conservative filter (< 5 copies of minor allele) to create its reference panel.
Interesting work!
Based on Dr. Marchini’s review paper (http://www.nature.com/nrg/journal/v11/n7/abs/nrg2796.html), IMPUTE 2 certainty metric and MACH r^2 actually are highly correlated. So I don’t understand why different thresholds were used for IMPUTE2 and Minimac in your study.
Could you please tell us a little bit more about how you estimated the mean concordance rate. Were all imputed SNPs included or only those high quality ones were included?
Thanks for your questions. It true that those metrics are correlated (Beagle R^2 seems to be as well) but the values have very different ranges. The R^2 values typically range from 0 to 1 while the certainty metric was observed between approximately 0.7 and 1.
The per-SNP concordance rate is essentially the percentage agreement between over all samples in the Illumina dataset for each SNP. The entire imputed dataset was used to average these values to find the mean concordance over all SNPs.
Thanks for the clarification.
But “…the certainty metric was observed between approximately 0.7 and 1.” sounds a little bit odd to me. Please check the following link.
http://www.nature.com/nrg/journal/v11/n7/extref/nrg2796-s5.pdf
Interesting work. What are the parameters you used in the runs for each software? eg. in minimac, if you increase the number of rounds, the results could be much improved.
Thanks for you question. For the imputation parameters, I used the recommended parameters or parameters used in the example documentation for each program. For mach phasing and minimac imputation (http://genome.sph.umich.edu/wiki/Minimac) that means, “–rounds 20 –states 200” and “–rounds 5 –states 200” respectively.
Thanks for the suggestion too. I’m running some more tests and I’ll play with this parameter to see if I can improve my results.
When you ran IMPUTE2 prephasing, did you use IMPUTE2’s own prephasing, the shapeit program, or the shapeit2 program? We have been running shapeit as the pre-phasing and did not observe this drop in quality.
Thanks for your question. I used Impute2 to perform prephasing following the example here http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#ex2. Although, the Impute2 folks do recommend elsewhere on their website that shapeit2 will provide higher accuracy.
Hi,
Very nice piece of work! Just a comment: the difference of imputation quality you observe between the two scenarios using Impute2, is likely due to, I quote: “An important factor in our testing was that we chose to run the entire length of chromosome 20 in a single batch”. Performance of the Impute2 phasing machinery decreases dramatically as the length of the studied region increases. On the website of the authors, it is advised not to go beyond ~5Mb chunks. Impute2 chooses the best conditioning haplotypes locally using Hamming distance: this strategy performs really well when the region is smaller than 5Mb, but very poorly at the whole chromosome scale. Two solutions to avoid this problem: (1) run prephasing with Impute2 in chunks smaller than 5Mb or run shapeit2 whole chromosome (the performance is independent of the length of the region studied). Even your Impute2 results obtained by integrating over uncertainty will be better.
There’s a problem with this analysis: the HapMap samples are all part of 1000 Genomes, so you’re trying to impute samples that have a perfect match in the reference panel. I think this explains why the results for IMPUTE2 are impossibly good — 93% of all SNPs imputed with a quality of 1. The results for Beagle and Minimac are closer to what I would expect; I guess these algorithms are less able to exploit very long-range matches between the test data and the reference panel. But all the results will be biased due to not using an out-of-sample test set.
Very nice work. Would it be possible to know which 141 samples you used? I would like to recreate your comparison and this would allow me to make sure I am doing it properly!
Pingback: To Impute, or not to Impute | Our 2 SNPs…®
Pingback: Our top 5 most visited blog posts | Our 2 SNPs…®
Hi Autumn. Would you please help me? I a studying about “Imputation of 90k SNP genotypes using different reference population size”. I have just a question. I don’t know how can check the quality of imputation, I mean accuracy of Imputation. For example, after imputation with beagle, I have beagle imputation output file, such as file.grobs, file.dose, file.r2. how could I get the accuracy of imputation? accuracy of Imputation per individual and per allele or genotypes.
Could I do this wotk with SnpStats packages in R environment? or plink? or which program do I need for this work?
Thank You.
Hi Mohammad,
I am happy to help with your imputation questions.
Beagle provides accuracy of imputation measurements in the allelic R^2 output file (file.r2). The allelic R^2 file contains two columns, the first column gives the marker identifiers and the second column gives the estimated squared correlation (0 <= R^2 <= 1) between the allele dosage with highest posterior probability in the genotype probabilities file (file.gprobs) and the true allele dosage for the marker. Larger values of allelic R^2 indicate more accurate genotype imputation. More information is available about the output from Beagle at the following link. https://faculty.washington.edu/browning/beagle/beagle_3.3.2_31Oct11.pdf
Let us know if you have any further questions.
Thanks,
Jami…
Senior Field Application Scientist
Golden Helix, Inc.
Thanks for sharing this interesting article. I have received the imputed data from IMPUTe 2 and I would like to if there is a suitable methodology to perform the post imputation QC and association analysis?
I have experience working a single file system where I had the data for all chromosomes, now I have the imputed data in separate files for Ch1 ~ 22.
Best
M
Hi Mahantesh, we did a webcast on this topic back in 2016 that you can access here https://www.goldenhelix.com/resources/webcasts/BEAGLE-Imputation-in-SVS/index.html.
This webcast covers the imputation approach and examining the genotype and QC output for further downstream analysis. In SVS, if you want to examine the data collectively, you can merge it into one file. Once merged, you can then perform the QC steps outlined in the webcast.
If you need further assistance with this please email [email protected], our support team would be happy to help!