Genotype Imputation and Phasing now in SNP & Variation Suite

One of the tools at the top of the toolbox for researchers working with microarray data is genotype imputation.

Genotype imputation is the process of inferring the genotype of one or more markers based on the correlation pattern (aka linkage disequilibrium or LD) of the surrounding markers for which genotypes are known.

We have now integrated a natively ported version of BEAGLE into Golden Helix’s SVS software. This integration makes imputation and phasing a first-class feature. One which does not require leaving the comprehensive SVS platform to run as part of your micro-array workflow.

Modern Imputation: The New Old Thing

Although imputation has been around since GWAS studies reached mainstream adoption as a research study paradigm, it has broadened in utility and adapted to a world with large whole genome reference datasets.

Today, the most common use cases include:

  • Increased density of genotype calls for fine mapping or to identify candidate causal variants at a susceptibility locus
  • Harmonize disparate SNP sets between microarray platforms so that they can be analyzed together in meta-analysis or mega-analysis
  • Merge public data or NGS variant data in with micro-array data for combined analysis

In human genetics, the largest public whole genome dataset now contains 2,535 individuals from 26 different populations around the world. For human data, the high-quality Phase3 genotypes of the 1000 genomes project is thus used as the “target” reference panel in a modern imputation workflow.

The full imputation process, in simplified terms, goes like this:

  1. Phase the reference panel data to create very long or chromosome-length haplotypes
  2. Phase the observed sample genotype data to create very long or chromosome-length haplotypes
  3. Use intersecting markers to identify the reference haplotype(s) most similar to each sample haplotype
  4. Impute the missing alleles on the sample haplotypes with the alleles observed on the corresponding reference haplotype
  5. Combine the imputed haplotypes for each sample into diploid genotypes for further analysis.

The first step can be skipped by using a pre-phased reference panel if there is one available (and there should be for the 1000 genomes data for humans). Otherwise, you would run this once on your own reference panel samples and save the phased reference panel data for your subsequent use in projects.

Save Time, Save Energy, Use SVS

With our recent release of SNP & Variation Suite, we have an integrated natively ported version of BEAGLE v4.1, and you are now able to do genotype phasing and imputation without leaving SVS.

We have evaluated the existing imputation methods and found that BEAGLE has consistently been updated to improve its speed and accuracy and work with the sizes of datasets used in human and agrigenomic contexts.

Since BEAGLE is an open source package, our work will be in porting its algorithmic core to a natively compiled, multi-threaded, stand-alone, open source package that communicates with SVS efficiently to handle large population datasets in a resource efficient manner.

We have supported exporting and importing to BEAGLE using custom scripts that wrote and read from BEAGLEs native file format.

The process previously looked like this:

  • Export your project marker map information to per-chromosome files
  • Export your project SNP data to per-chromosome files using export scripts
  • Run lots of BEAGLE program calls on each chromosome, saving the output files
  • Import the various output files using three different import scripts

With the integrated BEAGLE imputation, the new workflow is simplified to:

  • Open a BEAGLE imputation dialog, select what outputs you want
  • Click “Run”

There will be no intermediate files slowing down the process, as your spreadsheets containing your SNPs will be streamed directly into the phasing and imputation algorithms and the results written directly into SVS spreadsheets.

Imputation Without the Command Line

If this sounds like it could save you some time and mental energy over context switching and command line management, please contact us at [email protected] !

1 Comments

Leave a comment

Gabe Rudy

About Gabe Rudy

Gabe Rudy is the Vice President of Product and Engineering at Golden Helix, where for over two decades he has led the development of clinically validated software solutions that power precision medicine worldwide. Under his leadership, Golden Helix has delivered a suite of best-in-class tools for genomic analysis, including CNV calling, pharmacogenomics, carrier screening, and somatic variant interpretation. These solutions are designed for flexible deployment across on-premises, private cloud, and managed cloud environments, and are used by organizations ranging from small diagnostic teams to large clinical laboratories and even national-scale genomic initiatives. With a background in Computer Science and graduate work in compiler optimization and high-performance computing, Gabe brings a unique blend of software architecture expertise and deep domain knowledge in genomics. Since 2006, he directed product strategy and engineering at Golden Helix, ensuring the company stays at the forefront of innovation while maintaining the highest standards of usability, scalability, and quality. Gabe is an active participant in the genomics community, regularly presenting on topics such as NGS best practices, variant interpretation workflows, and the integration of AI into clinical diagnostics. His work has supported thousands of labs across the globe in the adoption of robust, intuitive, and clinically actionable bioinformatics workflows. Based in Bozeman, Montana, Gabe balances his passion for advancing precision medicine with family life and a love for the outdoors.

View all posts by Gabe Rudy →