“Who has ever had a bad experience with a VCF file?” I like to ask that question to the audience when I present data analysis workshops for Golden Helix. The question invariably draws laughter as many people raise their hands in the affirmative. It seems that just about everybody who has ever worked VCF files has encountered some sort of difficulty with the format’s complexity or practical limitations. Among other responsibilities at Golden Helix, I give frequent demonstrations of our SVS software and train people to use it. In this capacity I have observed many of the challenges people encounter with VCF files, and I’d like to discuss some of those issues here.
What’s the challenge?
The VCF format is defined by strict guidelines that have evolved over time. Version 4.1 has been the standard for quite some time now. Version 4.2, which includes optimizations for representing structural variants, was released late last year. The VCF format is what I would call a hybrid format, somewhere between a human-readable text file and a machine-readable data file. It is incredibly flexible—it can be used to represent phased or unphased SNVs, indels, or SVs from one or several DNA samples with or without user-defined annotation data. This flexibility is one reason for the format’s complexity. To quote from the official description of the VCFv4.1 format, “VCF is very expressive, accommodates multiple samples, and is widely used in the community. Its biggest drawback is that it is big and slow.” So even the people in charge understand that it’s not a perfect format.
Taking liberties with the format
Adhering to the format conventions is absolutely vital so that VCF files can read and be handled correctly by software like SVS and GenomeBrowse. We sometimes get tech support emails from people who say that they got a VCF file from their bioinformatics core, and our software won’t read their file. Further investigation usually reveals that the VCF file doesn’t conform to the specification. The header data will be modified, there will be non-standard annotation formats, or there will be other modifications intended to make the format easier to read by humans and/or easily viewed in Excel. Sometimes it won’t be a VCF file at all, but the output from a variant annotation program, and the client was told that it was a VCF. More and more software tools are using compressed VCF files paired with a tabix index file—this format is more secure from human meddling and also speeds up processing.
When is a multi-sample VCF more appropriate?
Many sequencing labs and bioinformatics cores have a standard processing pipeline that produces one VCF file per sample. This is all fine and good, if each sample is a self-contained research project, but that’s usually not the case. This is one of the most common workflow problems that I see among our users. The problem here is that VCF files only capture information about loci where there is sufficient evidence in the sequence alignment to make a variant call. You might assume that anything not annotated in the VCF file is simply homozygous for the reference base, and you would very often be wrong. There are many reasons why a variant allele will be absent from a VCF file, including low coverage or low sequencing quality. You simply can’t assume that the absence of data indicates a reference genotype. If your study design requires certainty about homozygous-reference genotypes, you will find that a multi-sample VCF file is more appropriate. Multi-sample VCF files capture important data, like read depth and genotype quality, for each sample in a cohort at every locus, where at least one sample has a variant.
A common example of this is with family trios, where a researcher is interested in finding de novo mutations. I’ll use some real data to illustrate: I have whole-exome sequence data for a family trio. Variants were first called individually and annotated in three separate VCF files. Based on this data, there are 2171 coding variants called with GQ>20 in the child, but no data at all in the VCF files of either parent. No, this isn’t a case of non-paternity—it’s a case of using the wrong experiment design. No more than a handful of these are likely to be true de novo mutations, but I simply don’t have the right information to identify them here. With three individual VCFs, the only de novo variants you can be certain of are those where the parents are both homozygous for a non-reference allele, and the child carries the reference or a different alternate than the parents. Such cases are almost always sequencing errors.
I can do much, much better if I take the same trio data, and run the BAM files through a variant caller that produces a multi-sample VCF. There are just 337 coding variants called in the child with GQ>20, where the parents are also called homozygous reference with similar quality. This is still too many to believe, but the multi-sample format also gives me allele depths and other useful information to sort it out. If I further require that the parents have no more than 5% of reads to carry an alternate allele, and that the child has at least 25% of reads supporting the alternate, then I’m left with just 5 candidate variants to review (and the alignments of those 5 all look fantastic in GenomeBrowse). Again, you can’t do that kind of work with 3 individual VCF files.
Another place where I have seen this issue arise is with cancer data. This topic deserves its own blog post, but for now, let’s just say that somatic mutations share similarities with de novo mutations. You can’t make somatic mutation calls by comparing VCF files from your tumor and normal samples—it requires comparing the BAM files directly, preferably with a variant caller designed for somatic mutations.
What about gVCF?
Genome VCF , or “gVCF” is an extension of the VCF format that captures information about the intervals between variants. So if there are two variants separated by 300bp, the gVCF file will include a record indicating that no variants are observed in that 300bp region, together with the minimum depth and quality information observed over the interval. gVCF is inherently a single-sample format, and allows for multiple samples to be compared somewhat directly without the necessity of creating a multi-sample VCF. I have heard that the Broad Institute recently adopted gVCF files as standard protocol for their internal analysis pipelines. I don’t have much direct experience with gVCF, and it is not yet supported by SVS. Intuitively, I would expect it is a good option with whole genomes and targeted applications where there is consistent sequencing coverage throughout the target regions, but could have more limitations when dealing with variable coverage data like exomes. I’m hoping to have more time to work with the gVCF format soon.
I’d like to leave you with a few thoughts:
- If you are in the business of generating VCF files, please, please stick to the format conventions.
- If you are a recipient of VCF files, please try to avoid editing them, or at least keep an original copy around that will work with software that is expecting standard formats.
- If you are an investigator just getting started with NGS data analysis, don’t be assume that the standard deliverable from your bioinformatics resource is right for your study. Talk to them about your study design and analysis goals and make sure to get the right VCF format for your needs.
And that’s my 2 SNPs.