The Golden Helix sales team recently came to me for recommendations regarding best practices for incorporating public controls in SNP GWAS. It seems that there has been a surge of questions regarding this practice over the past few weeks from our customers. Initially, I laughed at the irony of being asked to outline the best practices for what I see as an inherently problematic practice. Golden Helix has long been vocal about promoting good experimental design and warning about the dangers of batch effects (see Christophe Lambert’s post). Combining data from multiple sources is a guaranteed way to introduce batch effects into an analysis project. When some or all of that data comes from public data repositories for which you have minimal control over the data processing and no control of the sample handling and lab work, the problems are greatly compounded.
Despite the inherent difficulties, I recognize the great utility that comes from using public data. The statistical power of an analysis can be increased by including additional samples, and the validity of a study is enhanced if the results can be replicated in a second data set. Public data is very useful for developing and testing new analysis methods and it empowers researchers with limited funding opportunities to do high-impact research. Public data is powerful, but comes with a responsibility to use it wisely. So with these warnings in mind, I will share a few thoughts for wise use of public SNP data. We’ll start by looking at some ways that public data can be used, and then discuss a few guidelines for how to use it appropriately.
Some common uses for public data
- Testing new analytical methods: This is a fabulous use for public data. Anybody involved in methodology or algorithm development can download data from dbGap, GEO, or other public databases and compare their new methods to conventional methods and/or published results for that study. Another resource is the Genetic Analysis Workshop (GAW), which sponsors a biennial conference about contemporary problems in the analysis of genetic data. GAW creates extensive simulated data sets for each conference which may be requested by researchers for testing the accuracy and power of new methods.
- Reference samples for assessing population stratification: This is another great way to use public data. HapMap samples are particularly useful as anchor populations in principal components analysis (PCA). I recommend trying to find reference genotypes that match the array you used for your own data. Data for the HapMap phase I/II samples are available for almost all of the popular platforms, and phase III genotypes are available for some of the newer arrays. Contact your array vendor if you need help finding that data.
- Reference data for SNP imputation: Public data is usually the only resource available for a large-scale imputation reference. HapMap and the 1000 Genomes Project both have great value in this regard. Imputation is a complicated process, and several blog entries could be devoted to the best practices. For now, I will say that our experience suggests when using imputation to harmonize SNP data from multiple arrays, it works best to impute each data set individually against a common reference (HapMap or 1kG) before combining them.
- Replicating results of your own GWAS: This is a relatively safe use for public SNP data. Assuming that appropriate replication data exists in the public domain, a simple validation study does not require merging data sets, and the risk of results being influenced by batch differences and experimental artifacts is therefore reduced. Don’t assume that you can cut corners because somebody has already processed the data–start from the raw data and do a thorough analysis consistent with the procedures used for your own GWAS.
- Mega analysis: Mega analysis is the process of combining the genotype data from multiple GWAS studies into one big study with presumably greater statistical power. If you have two or more case control studies with equal proportions of cases and controls, and the experimental design was properly randomized, then you have ideal conditions for mega-analysis. Otherwise, be prepared to find major experimental artifacts in the analysis. You should be very, very careful.
- Using public controls when you only have case genotypes available: Do this only as a last resort! I must confess to having done this myself in grad school, but it is very problematic. You will have spurious associations in a GWAS that uses only public controls. If you have no other choice than to use public controls, be very careful to ensure appropriate matching of ethnicity, age, gender, and other important covariates. Use control samples genotyped with the exact same array as your cases and use as many controls as possible. Be prepared for some tough questions in peer review.
Some tips for appropriate use of public data
It’s impossible to be too careful when using public data, especially when combining data from multiple sources. Most public data sources have documentation about the study design and experimental protocols, but you should take nothing for granted. You can’t assume that phenotypes are correct, nor can you assume that genotypes are always correct. In our own experience with public data, we have found numerous examples where the reported gender for subjects did not match with the genotype data, indicating the possibility of even more errors in phenotype annotation. We have also found several instances of public studies that used the dual-array Affymetrix 500k platform where the two arrays were clearly mismatched for several subjects. When using public genotype data, you need to use strict quality control (QC) methods. Most studies on dbGap have some documentation for how the study was designed and typically include the raw data as well as a “cleaned” subset of the data. It is important to check the documentation to find out if there are known issues with the dataset, but I generally suggest starting from the raw data and doing your own QC rather than using the cleaned data.
Mega analysis requires special attention in both QC and association testing. When combining data sources, I recommend that you process the data for each study using a standard protocol. This means calling the genotypes using the same algorithm, following the same QC procedures, and applying the same filters. If a SNP fails QC in any one of the studies, consider removing that SNP from the overall analysis as well. Be aware of strand matching; Illumina’s “Top” strand is not the same as Affy’s “Top” strand. When possible, try to only use studies based the same genotyping array, or at least arrays from the same manufacturer. We have seen examples of spurious associations that resulted from two different array scanners being used in the same lab for one study. If spurious associations can arise from using different scanners in an otherwise homogeneous process, they can certainly arise from using different arrays from different manufacturers processed at different labs. A re-calling algorithm like BEAGLECALL can help to smooth out the rough edges in some cases, but not all. Depending on your study design, you may want to use imputation to combine data from mismatched arrays. Most imputation programs have some sort of genotype probability or quality scoring method–be sure that you use it.
It is absolutely vital to test for population stratification and for consistency between the various cohorts in mega analysis. PCA is helpful for identifying outlier subjects and confirming that the cohorts have similar structure. You might also do an association test comparing cohorts to each other and check for inflation of the genomic control lambda statistic. A general rule is that if lambda>1.1, the two cohorts may have dissimilar population structure and you should consider trimming out influential subjects or applying some sort of correction to your analysis (e.g. PCA or genomic control). It is always good practice in mega analysis to include a variable for the cohort and thereby correct for effects that may be due simply to experimental differences. This will be most useful if each cohort in the mega analysis includes both cases and controls. The reason why it is so dangerous to use public controls for testing associations with your own group of cases is the difficulty in controlling for these experimental effects.
Finally, as with any study, it is important to do post-analysis QC work to confirm the validity of any significant results. Check for possible confounding variables. Check the allele distributions for significant SNPs for cases and controls from each cohort. If something doesn’t seem right, it’s probably not. If raw data is available, check the cluster plots for the SNPs and make sure that the genotype calls make sense. Genotype calling algorithms can be especially inconsistent where rare minor alleles are concerned, and it is easier to check the validity of your results first than to see them fail in replications later.
There are definitely advantages to using public data, but there are also many dangers you must take into consideration. Again, it’s impossible to be too careful when using public data, so take your time and be very thorough. … And that’s my two SNPs.