Dr. Bryce Christensen recently gave a webcast on Maximizing Public Data Sources for Sequencing and GWAS Studies in which he covered options for getting GWAS and sequence information online, tips for working with these datasets and what you’ll see in terms of data quality and usefulness, how to use public data sources in conjunction with your GWAS or sequence study (and how NOT to), and data management and manipulation features in SNP & Variation Suite to more effectively utilize online databases. In this blog post, I’ll summarize his suggestions for how to use public data effectively.
It is common knowledge that there is a wealth of public data available to researchers: the NCBI, EGA, HapMap Project, 1000 Genomes Project, GAW, and more. Plus, there’s data that can be obtained from hardware vendors, software vendors such as Golden Helix, and even individual research labs who make data available on their websites.
The question becomes when is it appropriate to use public data and in what context? These are valid concerns as many researchers have used public data with hacksaw methodology that results in questionable findings. Yet, when used correctly, public data can be used as a reference to give additional insight about your study, it can be used to increase your dataset size, and it can help you test new methods or software.
Use as a reference
The first way to use public data is as reference samples for assessing population structure in GWAS. Below you can see using HapMap data to anchor a Principle Component Analysis and give an idea of the ethnic structure of study samples.
Another very popular and noble way to use public data is replicating the results of your own GWAS or other research. Your data might give very encouraging results, and these findings can then be “checked” to see if they hold up in an external population with the same phenotype, from sources such as dbGaP.
For those venturing into next-generation sequencing, public data can instead be used to annotate sequence variants to gain a broader picture of what is happening with your data. This is especially important as research moves into the clinic.
Increase your dataset
Another use of public data that can be advantageous to researchers is to combine datasets together for meta-analysis or mega-analysis. A researcher’s GWAS can be combined with, for example, a study from dbGaP, and another study from EGA, and even a study from a friend doing research two states away. By using many different datasets, the sample size drastically increases, which results in greater statistical power.
Genotype imputation is a common and popular option for expanding the knowledge we get from SNP genotyping arrays. Public data is usually the only option available to a researcher who needs a large population for a reference panel for imputing SNPs.
And finally a warning: Only as a last resort should public controls be used in the absence of your own controls as this practice can lead to skewed results (see blog post on Experimental Design).
Test new methods or software
If you are involved with method development and you have new analytic methods you want to test, public data is a great option. A lot of the public data out there has published results you can compare results of a new method against.
Researchers sometimes have a brief time span where they can evaluate different software tools before their data actually arrives. Often, it is beneficial during this time period to use public data to evaluate products so that when the new data does arrive, a researcher can hit the ground running.
Proceed with caution
One thing that is important to note is that just because public data has already been processed does not mean that corners should be cut in terms of quality control; public data should always be treated as fresh, unprocessed data.
Public data is widely available and can be used in a variety of ways. However a researcher plans to implement public data, it is important to remember: if you know what you’re looking for, quite often, you can find it.