Tis the season of quiet, productive hours. I’ve been spending a lot of mine thinking about file formats. Actually, I’ve been spending mine implementing a new one, but more on that later.
File formats are amazingly important in big data science. In genomics, it is hard not to be awed by how successful the BAM file format is.
I thought one of the most tweetable moments at ASHG 2013 was when Jeffrey Reid from BCM Human Genome Sequencing Center (HGSC) talked about how they offloaded to the cloud (via DNAnexus) 2.4 million hours of computing time to perform the alignment and variant calling on ~4k genomes and ~12k exomes.
In the process, they produced roughly half a petabyte of BAM files (well mostly BAM files, VCFs are an order of magnitude smaller, but part of the output mix).
I’d speculate that Heng Li‘s binary file format for storing alignments of short reads to a reference genome is responsible for more bytes of data being stored on the cloud (and maybe in general) than any other file format in the mere 4 years since it was invented.
But really, the genius of the format was not in the clever and extensible encoding of the output of alignment algorithms (the CIGAR string and key-value pair “tag” field have held up remarkably well through years of innovation and dozens of tools), but in the one-to-one relationship it shared with its text-based counterpart, the SAM file.
Wanted: Grey Beard Unix Hackers for Big Data Science
Text files have been, and will continue to be, the dominant medium of data scientists – and for good reasons. Even a basic familiarity with the Unix toolset allows for an immense amount of data management and processing tasks to take place. And to take place efficiently. I dare you to write a program that is faster at searching text files than grep.
In fact, any bioinformatic pipeline worth its salt is probably heavily utilizing grep, awk, cut, sed, sort, xargs and, yes, the rusty edged Swiss army knife that is perl. We’ve worked on a couple here at Golden Helix. You can feel the grey beard sprouting every time you supplement your power with another tool like mkfifo.
But text files are inherently just serial streams of records (lines). You have to be willing to read from start to end to find some useful piece of data.
This leads to the second innovation (or just damn good engineering choice, but why not call that an innovation) that resulted in Heng Li’s file format(s) taking over genetics: a genomic index into the (compressed) file.
While it is fine to have a 100GB compressed whole genome BAM file read linearly from start to finish if you have to do some operation over the whole file (like count the number of unique alignments or perform Indel Realignment), once the file is finished being produced, it’s likely to only be interrogated with specific regions of the genome. What do the alignments look like around this candidate variant? What’s the coverage of this exon?
Heng Li solved this by having a secondary index file (bai file) that allowed you to skip directly into the file and start reading lines that overlap a given genomic location (such as chr6:32546547-32557562). Compressed files generally don’t allow you to jump into their middle and start reading, so he also cleverly adapted a property of gzip files to create BGZF files that do allow you to seek into any offset and read. Although created for SAM/BAM, this can be generalized to any text file with genomic coordinates (with tabix).
I’ll Take My Genotypes On Stone Tablets, Please
What makes all this compressing of files and indexing them work particularly well is that big data science is largely written once, read many times in its data lifecycle. Sure, we may take three passes over a BAM to get it recalibrated, realigned etc., but we always read it all in and write it all out.
In other words, bioinformatic pipelines use a functional programming paradigm: the data you read is immutable, so simply create your own functional transformation of it. They don’t call them pipelines for nothing!
You never change a genotype in a VCF file from one thing to another by hand.
What you are likely to do is extend it with many annotations, analyze its fields to understand their distribution and statistics, and then perform a series of informed filters on many fields and annotations to find those few variants of importance.
But simply using a genomic index into a sorted stream of variant records no longer suffices to perform these tasks.
Thinking at the Speed of 1 Million Variant Histograms
If you think of a set of variants as a table with variants as the rows, you actually do most of your analysis in a column-oriented fashion.
Let’s break down the workflow of analyzing my 1 million variants of the exomes I have for myself, wife and son. I have a VCF file produced by early access to Real Time Genomics’ alignment and variant calling tools, but it could be any exome produced on any pipeline.
The first thing I might do is try to understand how I might define high-quality variants. RTG has a special score called the AVR score they use to incorporate much of their heuristics that can separate high quality from low-quality variants. Here is the histogram of that score from 0 to 0.1 (lower score equates to the lower quality):
So I can see the majority of variants fall below the 0.05 threshold (a threshold RTG suggests for distinguishing “high quality” variants) and that a score of 0.02 might remove roughly half the variants.
These types of queries and investigations are desired at each step of the process. I might also look at read depth, genotype quality, allele frequency as annotated by 1000 Genomes, NHBI 6500 exomes, dbSNP and others. Then filter and sort on things like a variant’s protein-coding effect and predicted functional effect and the genotype pattern of my family.
Without getting into the details of each annotation and filtering step, it is clear that we are operating on only a handful of “fields” at a time for each step (for example, the AVR field, the allele frequency field, my family’s genotype fields, etc). We never need to look at the entire bolus of data composing a variant record and all of its annotations for any given operation.
This distinction may not matter when dealing with hundreds or thousands of variants, but to provide a user experience that does not inhibit the creative thinking process while analyzing hundreds of thousands and millions of variants, you must optimize the way you read data.
The goal is to perform an operation like sort a table, apply a filter or show a histogram faster than it takes the user to perform the mental action of choosing their next step: roughly 1.2 seconds. Simple operations like adjusting a filter should actually take less than a mere 0.07 seconds to match the mental speed of a human performing a simple step.
If you can achieve these speeds, then the user will have the experience of creatively flowing through the software at the speed of thought. If not, you risk disrupting the cadence of that flow or worse, causing a dreaded task switch while the user waits.
As an aside: this also can make it difficult for any software not running on the user’s own hardware to achieve these speeds given any action performed through a website incurs the latency of communicating to the server and back through the internet (on average 0.1 to 0.5 seconds).
Columnar Storage: The Old New Thing
The solution to optimized field-based access of tables is simple: store your data field-wise!
Columnar storage databases are nothing new and have a number of very nice properties like the fact that your compression ratios are ridiculously good when all your data of the same field type are stored contiguously. But they are not well suited to transactional workflows that involve adding, editing, and deleting rows. Luckily, as we discussed earlier, we can embrace write-once technologies in data science.
An example of this is the recent announcement of Amazon RedShift: a cloud-based data analysis engine based on columnar storage designed to scale to petabytes of data.
Open source projects like PyTables have been providing columnar storage backends to Python’s scientific analysis platform for many big data scientific fields, including astrophysics and earth science.
Meanwhile, traditional RDBMS databases are growing the ability to use columnar storage for specific use cases like indexes on tables that don’t change frequently.
In genomics, once the data production portion of generating a variant file is complete, it makes sense to transpose the storage format from record based ones like VCF (and its in-development binary sibling BCF) to a columnar one.
May I Present: Tabular Storage Format (TSF1)
Columnar storage is also nothing new to us at Golden Helix. Since I started working on our code base 11 years ago (yes, you read that right), we have been storing data as a series of arrays ready to be read directly into the algorithms that powered our software. In fact, our first file format was not much more than that!
Believe it or not, we have gone through 3 file formats (with multiple versions) since then, and I’ve had a chance to learn about their strengths and weaknesses as our software adapted to the changing landscape of genomic analysis.
So from all that learning, and from thoughts no doubt inspired by Heng Li’s successful BAM and Tabix, there are three things that have motivated me to create a 4th file format:
- Genomic data has a mix of attribute fields (like a variant’s RSID) and matrix fields that have an entity dimension (like Genotypes and Read Depths). It’s useful to keep those matrices aligned and in sync with a view of a source.
- Columns are best stored in the longer locus dimension (i.e. one column for each sample or attribute of variants). But because this dimension can be millions of elements long, you need to segment your columns into chunks so you can read a given record efficiently.
- Compression of your data in the file format always makes sense as long as you can work with blocks of data large enough to compress. (see Why Modern CPUs are Starving)
The first point is best illustrated by thinking about my RTG VCF file with my 3 samples in it.
I’m calling fields like RSID and Ref Alleles Locus Attribute fields and Genotypes and AVR Scores Matrix fields (there is actually also one Entity Attribute field not shown. Can you guess what it is?). If I were to filter to a subset of variants, I will still want at hand all the matrix fields for those variants. If one were to flatten this to a single table (like our current file format does), you have to choose how to represent these matrices (collated by sample, or joined end-to-end, etc.).
The second point about segmenting the columns is an implementation detail that results in a couple of important properties: namely record based access can be just as fast (if not faster) and you end up with units of data for storage that lend themselves to compression.
For most types of data, zlib compression will result in very good compression ratios. But with columnar data, you can take advantage of the uniform nature of the elements in your data to do things like shuffle the bits before compression. The Blosc meta-compressor can take information about the type-size of its data and considerably outperform zlib on things like columns of sorted numbers or uniformly sized strings (i.e. indexes).
With a tabular based storage system that has excellent record-wise lookups and the ability to have extendable indexing for genomic queries (like tabix), we can actually outperform the special purpose file formats we have used in the past in nearly every possible dimension (space, speed, usage modes).
|Powers our spreadsheets. Has no concept of what is being stored in each axis.
|Whole typed single columns
|Column-only (row-wise read requires holding entire dataset in memory)
|Append column only (appending rows requires rewriting entire dataset)
|Powers our annotation tracks (currently).
|Records of strongly typed fields with a genomic coordinate indexed.
|Genomic query resulting in records only. No tabular access.
|Append records only (appending a field requires rewriting entire dataset)
|Provides genomic annotation to our spreadsheets (marker maps).
|Records of strongly typed fields with caches of genomic coordinates in columnar arrays.
|Table indexes for records, but records guaranteed to be in genomic coordinate order.
|Append records only. Append fields possible but results in a dataset rewrite.
|Annotation track (next release) and all usages in the future.
|Chunked columnar typed fields and matrices.
|Table index, field or record-wise. Genomic query with index computed.
|Append fields. Append records requires only mutating the last chunk for each field.
How Tabular Storage Format (TSF) compares to our current special case storage formats.
Armchair Diagnosis: Sir, You Have Not-Invented-Here Syndrome
It may seem strange to you that we have been inventing our own file formats for the past 15 years of Golden Helix and now I’m doing it again.
I get it. Heck, I agree with you! I have been spending the last two years scouring Hacker News, proggit, twitter, and many great technical manuals, discussion groups and papers (and another) looking for an off-the-shelf solution that fits the bill for building a local (embedded) high-performance analysis software with big-data science data sets.
I’ve looked at MonetDb (columnar store client/server db), LevelDB (scalable key-value store), Kyoto Cabinet (another key-value store), PyTables (native python array storage on HDF5), raw HDF5, SciDb (massively parallel distributed array-based computation), and any number of server-oriented solutions like Oracle Exadata, SQL Server 2012 and MySQL’s ICE.
The closest thing to winning is PyTables (which I studied closely), but it falls short by nature of its Python implementation not allowing it to be run in background threads. For a file format that needs to power our massively multi-threaded GenomeBrowse visualization, this is a show stopper.
HDF5 is the runner-up, with its ability to store strongly typed arrays in any number of dimensions and configurations. It is in fact heavily used in our industry. Affymetrix uses it in their chromosomal microarray analysis and PacBio stores their primary sequence data output in it. But HDF5 is really just a container format on top of which you would have to build a chunk-columnar store with genomic indexing.
We have been happily using SQLite as our containment format since 2005 (all the formats listed above use SQLite). SQLite is itself a powerful embedded database engine and is the most mature piece of technology I have ever had the pleasure of working with.
SQLite is an amazingly good (and recommended) choice for an application file format. It supports multi-threaded and multi-process concurrent access with all the difficulties of different locking behaviors worked out across platforms. It has useful constructs like indexes and page caching all figured out. And it is very darn efficient in its use of space.
Most importantly, it is entirely based on a single file. Yes, one of those things you can copy, move, redistribute, archive, or even email. And it contains all your meta-data, index data, columnar data and potentially multiple sources of data.
All in a file.
Flat Files on a Desktop? But, but Dude… the Cloud!
I’m a big fan of the file.
It takes zero IT intervention to use at your own will. It can be on your Mac laptop as easily as your Windows desktop. It can be shared through Dropbox, FTP, HTTP and placed on a network share.
As a company, we have built our livelihood around the idea of taking best practice methods in genomics (often available in free tools like R and PLINK) and empowering people to do analysis they couldn’t easily do themselves.
How empowering would it be to require every user of our software to have a big iron database server to handle the data storage?
How empowering would it be if storage and analysis was done on rented computers accessed over the internet – a solution requiring a business model that cannot provide a fixed cost for your analysis to put in your budget and adds a recurring charge for you data just existing?
I’m a big fan of cloud services. I use Evernote addictively. Our own community’s web-based resources like NCBI dbSNP and the UCSC Gnome Browser are foundational to the field. I even understand why a data production pipeline may benefit from the elastic scaling provided by the cloud.
But once you stretch the imagination to see a world where you can do your exome and whole genome analysis on your own device, at the speed of thought and not need more storage than the size of the compressed data you started with, it’s hard to see how there is anything more empowering than that.
Gun Slinging Cowboy Coding
So I’ve told you about the major points of TSF. But the devil is always in the details. This thing is going to take many man-years to see the light of day, right? Nope. It’s done.
And even cooler: it took only a week to write!
Well, I suppose I should be accurate and explain that it took two years of planning and preparation, many man-years of supporting infrastructure and framework first… and then a week to write. Well, and then another week to optimize and improve the genomic index, etc.
But it’s done. And… drumroll… it totally kicks ass.
Here is the size of some converted files compared to other formats:
|822MB txt.gz, 6.4GB txt
|1.5GB .txt.gz, 10GB txt
|2.6GB zip, 26.9GB txt
|Trio RTG VCF
|60.4MB vcf.gz+tbi, 225MB vcf
|My Exome GATK VCF
TSF often beats the compressed size of native text representations and has the genomic index built-in.
When will the world see it?
The first of TSF’s assignments will be to completely replace IDF2 in our upcoming Golden Helix SVS 8.1 release. Most people may only notice the 80% or more reduction in file size of the annotation files they need to annotate and filter their data, but soon we will start taking advantage of the quick tabular access, easy joining, filtering and subsetting it is capable of providing. But more on that at a later time.