One of the tools at the top of the toolbox for researchers working with microarray data is genotype imputation.
Genotype imputation is the process of inferring the genotype of one or more markers based on the correlation pattern (aka linkage disequilibrium or LD) of the surrounding markers for which genotypes are known.
We have now integrated a natively ported version of BEAGLE into Golden Helix’s SVS software. This integration makes imputation and phasing a first-class feature. One which does not require leaving the comprehensive SVS platform to run as part of your micro-array workflow.
Modern Imputation: The New Old Thing
Although imputation has been around since GWAS studies reached mainstream adoption as a research study paradigm, it has broadened in utility and adapted to a world with large whole genome reference datasets.
Today, the most common use cases include:
- Increased density of genotype calls for fine mapping or to identify candidate causal variants at a susceptibility locus
- Harmonize disparate SNP sets between microarray platforms so that they can be analyzed together in meta-analysis or mega-analysis
- Merge public data or NGS variant data in with micro-array data for combined analysis
In human genetics, the largest public whole genome dataset now contains 2,535 individuals from 26 different populations around the world. For human data, the high-quality Phase3 genotypes of the 1000 genomes project is thus used as the “target” reference panel in a modern imputation workflow.
The full imputation process, in simplified terms, goes like this:
- Phase the reference panel data to create very long or chromosome-length haplotypes
- Phase the observed sample genotype data to create very long or chromosome-length haplotypes
- Use intersecting markers to identify the reference haplotype(s) most similar to each sample haplotype
- Impute the missing alleles on the sample haplotypes with the alleles observed on the corresponding reference haplotype
- Combine the imputed haplotypes for each sample into diploid genotypes for further analysis.
The first step can be skipped by using a pre-phased reference panel if there is one available (and there should be for the 1000 genomes data for humans). Otherwise, you would run this once on your own reference panel samples and save the phased reference panel data for your subsequent use in projects.
Save Time, Save Energy, Use SVS
With our recent release of SNP & Variation Suite, we have an integrated natively ported version of BEAGLE v4.1, and you are now able to do genotype phasing and imputation without leaving SVS.
We have evaluated the existing imputation methods and found that BEAGLE has consistently been updated to improve its speed and accuracy and work with the sizes of datasets used in human and agrigenomic contexts.
Since BEAGLE is an open source package, our work will be in porting its algorithmic core to a natively compiled, multi-threaded, stand-alone, open source package that communicates with SVS efficiently to handle large population datasets in a resource efficient manner.
We have supported exporting and importing to BEAGLE using custom scripts that wrote and read from BEAGLEs native file format.
The process previously looked like this:
- Export your project marker map information to per-chromosome files
- Export your project SNP data to per-chromosome files using export scripts
- Run lots of BEAGLE program calls on each chromosome, saving the output files
- Import the various output files using three different import scripts
With the integrated BEAGLE imputation, the new workflow is simplified to:
- Open a BEAGLE imputation dialog, select what outputs you want
- Click “Run”
There will be no intermediate files slowing down the process, as your spreadsheets containing your SNPs will be streamed directly into the phasing and imputation algorithms and the results written directly into SVS spreadsheets.
Imputation Without the Command Line
If this sounds like it could save you some time and mental energy over context switching and command line management, please contact us at [email protected] !
What size datasets can Beagle 4.1 run and how long does it take it to run in SVS?