Today I ran into an interesting fact about how a prolifically used catalog of population controls classifies African Americans with potential impacts on research outcomes.
The 1000 Genomes Project is arguably our best common set of controls used in genomic studies.
They recently finished what was termed as “Phase 1” of the project, and they have been releasing full sets of the variants they discovered in over 1,092 individuals from various population backgrounds.
One of the key attributes of any population variant catalog is the frequency in which the variant allele shows up.
Having allele frequency information allows you to filter variants from your own samples. For example, researchers often do a first pass filter of their variants to only investigate “rare” variants.
If your samples are European, you often see “rare” variants defined as variants that show up in less than 1% of Europeans according to 1000 Genomes Project.
But those 1,092 individuals are from 14 different population groups. Some of these groups have well-defined ancestries and some of them do not.
So how the heck do you define individuals as Europeans, or African or Asian?
Thankfully, it’s not by asking the individuals or guessing based on where they are from.
At the time, we saw we could define European, African, and Asian based on the very tight clustering of sub-populations with common ancestral background. We used these groups to build population-specific allele frequency tracks for our genome browser and filtering workflows.
So today, I was going through the VCF header of the third release of the Phase 1 variant catalog, and I noticed they had already broken out frequencies by the groups EUR, AFR, ASN, and AMR.
Great! They did the hard work for us this time!
So to see how they got to these population groups, I took at look at sample manifest.
Here are the number of samples per group:
The AMR group is all the admixure individuals from various North American groups such as Mexicans from Los Angeles and Puerto Ricans.
Well, all was looking good, but they definitely have more samples in the African group than we found.
Why is that?
It turns out they included the 61 individuals classified as African Ancestry in Southwest US (ASW) in the African group.
As you can see in our PCA plot, this group clearly has a mixed ancestry and hence has individuals that stretch between the groups represented by the African and European clusters.
Well yeah, that’s probably not a big surprise.
So why did they put them in the African group for the genetic research community to treat as a filter of African controls?
Will this dramatically impact the results of future research?
Probably not. But only because the ASW group contributes 61 of the 246 African individuals and will hence not dominate the allele frequencies.
Personally, I think it’s a bit sloppy.
But at the end of the day, what will matter more as a community is that we are talking about the same thing when we say “I filtered against the African allele frequency from the 1000 Genomes Phase I data.”
And for that reason, I’m going with it.