If you have ever worked with NGS variant data, you may have come to realize that the first task at hand is the seemingly simple categorization of your variants into two bins: known and novel.
Of course, if you’ve ever worked with NGS variant data, you may have also come to the realization that this step is more complex than it seems. It requires the important choice of how you define a variant as “known” or “common” and what the implications of that choice are for your analysis. You are not short on choices of sources for known variants from the venerable dbSNP archive to the clean-slate 1000 Genomes project. It was even mentioned at ASHG/ICHG this year that Complete Genomics intends to add a “globally diverse” panel of reference genomes to their current repository of 69 public genomes.
Often the best choice for your novel variant categorization is to use a reference panel of control individuals with the closest matching ethnicity. But rather than have to face the task of compiling all those reference panels yourself for analysis, it may be preferable to curate them from a known source. In this blog post we will demonstrate the process for curating this data using the 1000 Genomes project’s data on 1,094 individuals of diverse origin as an example. The results of this process are annotation tracks of allele frequencies that can be used in the SNP & Variation Suite (SVS) genome browser and filtering workflows (like those in our NGS Variant Analysis tutorial).
About 1000 Genomes Project to Date
The goal of the 1000 Genomes project is to “find most genetic variants that have frequencies of at least 1% in the population studied.” The project was broken down into three pilot projects and the main project. The pilot projects served to assess and help define the project specifications. Data was analyzed by multiple centers on various platforms and the methods for gene region capturing were also assessed in the pilot projects. The pilot project is now complete, and it determined that 4x coverage was sufficient to meet the goal of identifying most of the variants with a frequency greater than or equal to 1%. Now they have moved onto the main project which is broken up into three phases. Phase 1 will be to sequence 1,167 samples from 13 populations, Phase 2 to sequence 633 samples from 7 populations, and finally Phase 3 to sequence 779 samples from about 8 populations. Unlike the HapMap project, no assumptions regarding “health” are made with these samples.
To date, a subset of 1,094 samples from Phase 1 has had whole genome sequencing performed and the data made available. We will use this dataset to create variant frequency annotations for the whole population set and sub-populations.
Let’s Start Curating
For the 1,094 sample public data, there are two types of files: an “All Sites” file and genotype calls split by chromosome file.
The “All Sites” file (ALL.wgs.phase1.projectConsensus.snps.sites.vcf) gives counts of all alleles, alternate alleles, and alternate allele frequencies (AAF). The AAF is computed from a joint analysis of the alignments of all samples as the “alt alleles” counts vary greatly. This file was used directly to create the 1kG_Phase1_All-Sites-2011_05-GHI_GRCh_37_Homo_sapiens annotation track available to download in SVS. No extra work was required to create this track other than selecting the proper fields from the VCF file to include. While the header information in the VCF file claimed that sites that had more than one alternate allele frequency would have the allele frequencies listed for each ALT allele in the same order as listed in the ALT field, there was only ever one alternate allele frequency, thus the same frequency information was used for each “Ref/Alt” pair.
The second type of file contains genotype calls which we enhanced using SNPTools (an imputation program) to make the calls more accurate. Thus there are no missing values in the calls for each variant. As a result, the genotype calls are not directly equivalent to the allele count results in the “All sites” file. The benefit to the genotype calls, however, is the samples can be subdivided by population to get population specific variant genotype calls as well as population specific alternate allele frequencies.
So how should the populations be defined in order to build these alternate allele frequencies? Well, there are a couple of options. The populations of the samples given by the 1000 Genomes project are from several different continents; in the first phase of the project, the continents include the Americas, Europe, East Asia, and West Africa.
|Number of Samples
|HapMap African ancestry individuals from SW US
|(CHB) Han Chinese in Beijing
|(CHB) Han Chinese South
|Colombian in Medellin, Colombia
|HapMap Finnish individuals from Finland
|British individuals from England and Scotland (GBR)
|Iberian populations in Spain
|JPT Japanese individuals
|(LWK) Luhya individuals
|HapMap Mexican individuals from LA California
|Puerto Rican in Puerto Rico
|(YRI) Yoruba individuals
It seems that the “Continent” categorization results in large enough groups to providing meaningful allele frequencies. But to validate this approach, let’s use principal component analysis (PCA) of the genotypes for all samples. We should find in the first two Principle Components (PCs) a strong segregation by population and hopefully continent.
This is is easier said than done. Just take a look at the size of those genotype VCF files! We went ahead and imported those massive VCF files all at once into a Golden Helix SVS project. The import took several days on a Windows 2008 server with 8 cores and 64 GB of RAM but went smoothly and resulted in a spreadsheet with 38,877,749 genotype columns and 1,094 sample rows.
Before we ran PCA on the genotype data, a basic set of filters, available in the Golden Helix SVS program, were applied to get a set of genotypes of common variants in linkage equilibrium for autosomes only.
- LD Pruning: The variant genotypes (38,877,749 in total) for all populations were first pruned by linkage disequilibrium (LD) to inactivate the first marker of a pair of markers within a 50 marker window that had a Composite Haplotype Method (CHM) R^2 value of greater than 0.5. This reduced the number of genotype calls by 12,189,892 markers, leaving a total of 26,687,857 markers.
- Autosomes: Only markers from autosomes were selected, all other markers were filtered from the genotype spreadsheet. This left 25,675,011 markers.
- MAF >= 0.3: The variant genotypes were filtered on Minor Allele Frequency < 0.3, only markers with an MAF >= 0.3 were kept in the spreadsheet for PCA calculations. This left only 493,175 markers.
PCA was performed looking for the top 10 principal components using an additive genetic model and the markers were normalized based on the Theoretical Sigma at Hardy-Weinberg Equilibrium (HWE). Outliers were not removed nor were components recomputed from the principal component computations. See figure below for a scree plot of the first 10 eigenvalues computed from the principal component analysis.
Figure 1: Scree plot of eigenvalues
The first two principal components were plotted in an XY Scatter plot in Golden Helix SVS and the data points were colored based on population group.
Figure 2: EV1 vs EV2 Colored on Population Group
By examining the plot in Figure 3, we can clearly see three major clusters of individuals. But this is also a perfect example of what admixture populations look like in PCA plots. Specifically the data points for the samples in the ASW, PUR, CLM, and MXL populations appear to be mixtures of the most distant population clusters. These populations, not coincidentally, also happened to be the four populations in the Americas continent classification group. After these populations were hidden from the PCA plot three clean continental groups were clearly defined.
Figure 3: EV1 vs EV2 Colored on Population – Admixture Groups Removed
Naturally the populations left were defined into a European population, an Asian population, and an African population according to these population groups:
|Number of Samples
Now that we have clearly defined and validated continent groups, the next step is to create annotation tracks for population specific alternate allele frequencies. First we need to create subset spreadsheets for each continent. Initially though, these subset spreadsheets will contain the full set of variants. To get variants unique to each continent group we will count the number of alternate alleles for each variant in our population subset and filter out those variants that don’t have at least one alternate allele (i.e. all samples are the reference for that locus). The number of variants per continent group that remain are listed in the table below.
|Number of Variants
We can then compute the alternate allele frequencies (count of the number of alternate alleles over the total number of alleles) for each variant site in each continent group and make annotation tracks simultaneously containing the location of the variants, the reference allele, alternate alleles, and alternate allele frequencies.
The names of these annotation tracks are:
These annotation tracks can be downloaded through the Golden Helix SVS Annotation Track Manager by opening Golden Helix SVS and going to Tools > Manage Annotation Tracks and clicking on the Download from Network button. A reminder, all of these annotation tracks are for hg_19 (GRCh_37) and so the filter for annotation tracks must be set to this build to have these tracks visible in the download list.
Using These Annotation Tracks
Here are just some examples of ways these annotation tracks might be used to analyze your own variant data set:
- Filter sequence data based on presence (or absence) in the 1kG Phase 1 All Sites probe track or the continent specific probe tracks.
Select > Filter by Annotation > Filter by Probe Track Membership
- Bin the variants based on the Alternate Allele Frequencies in the 1kG Phase 1 continent specific probe tracks. This can then be used for the CMC collapsing method.
Quality Assurance > Genotype > Variant Binning by Frequency Track and
Analysis > Collapsing Methods > CMC with Hotelling T Squared Test or
Analysis > Collapsing Methods > CMC with Regression.
Note: the above filtering and analysis methods require the Sequence Module for Golden Helix SVS.
With all the great public data repositories out there, it can still seem like a large hurdle to utilize those often unrefined sources in your own experiment’s analysis. At Golden Helix, we think providing public data in a pre-curated and immediately useful form, like we showed in this blog post, is just as important as the analysis methods themselves in performing meaningful analysis. We will continue to listen to our customers and incorporate new data releases and sources to our annotation repository. Of course, everything we did here is completely repeatable and supported within SVS so you can just as easily create private or public annotation tracks. …And that’s my two SNPs.