All I Want for Christmas is a New File Format for Genomics

Tis the season of quiet, productive hours. I’ve been spending a lot of mine thinking about file formats. Actually, I’ve been spending mine implementing a new one, but more on that later.

File formats are amazingly important in big data science. In genomics, it is hard not to be awed by how successful the BAM file format is.

Heng Li thinking up another algorithm…

I thought one of the most tweetable moments at ASHG 2013 was when Jeffrey Reid from BCM Human Genome Sequencing Center (HGSC) talked about how they offloaded to the cloud (via DNAnexus) 2.4 million hours of computing time to perform the alignment and variant calling on ~4k genomes and ~12k exomes.

In the process, they produced roughly half a petabyte of BAM files (well mostly BAM files, VCFs are an order of magnitude smaller, but part of the output mix).

I’d speculate that Heng Li‘s binary file format for storing alignments of short reads to a reference genome is responsible for more bytes of data being stored on the cloud (and maybe in general) than any other file format in the mere 4 years since it was invented.

But really, the genius of the format was not in the clever and extensible encoding of the output of alignment algorithms (the CIGAR string and key-value pair “tag” field have held up remarkably well through years of innovation and dozens of tools), but in the one-to-one relationship it shared with its text-based counterpart, the SAM file.

Wanted: Grey Beard Unix Hackers for Big Data Science
Text files have been, and will continue to be, the dominant medium of data scientists – and for good reasons. Even a basic familiarity with the Unix toolset allows for an immense amount of data management and processing tasks to take place. And to take place efficiently. I dare you to write a program that is faster at searching text files than grep.

In fact, any bioinformatic pipeline worth its salt is probably heavily utilizing grep, awk, cut, sed, sort, xargs and, yes, the rusty edged Swiss army knife that is perl. We’ve worked on a couple here at Golden Helix. You can feel the grey beard sprouting every time you supplement your power with another tool like mkfifo.

But text files are inherently just serial streams of records (lines). You have to be willing to read from start to end to find some useful piece of data.

The fathers of Unix, Dennis Ritchie, and Ken Thompson sport the original greybeards

This leads to the second innovation (or just damn good engineering choice, but why not call that an innovation) that resulted in Heng Li’s file format(s) taking over genetics: a genomic index into the (compressed) file.

While it is fine to have a 100GB compressed whole genome BAM file read linearly from start to finish if you have to do some operation over the whole file (like count the number of unique alignments or perform Indel Realignment), once the file is finished being produced, it’s likely to only be interrogated with specific regions of the genome. What do the alignments look like around this candidate variant? What’s the coverage of this exon?

Heng Li solved this by having a secondary index file (bai file) that allowed you to skip directly into the file and start reading lines that overlap a given genomic location (such as chr6:32546547-32557562). Compressed files generally don’t allow you to jump into their middle and start reading, so he also cleverly adapted a property of gzip files to create BGZF files that do allow you to seek into any offset and read. Although created for SAM/BAM, this can be generalized to any text file with genomic coordinates (with tabix).

I’ll Take My Genotypes On Stone Tablets, Please
What makes all this compressing of files and indexing them work particularly well is that big data science is largely written once, read many times in its data lifecycle. Sure, we may take three passes over a BAM to get it recalibrated, realigned etc., but we always read it all in and write it all out.

In other words, bioinformatic pipelines use a functional programming paradigm: the data you read is immutable, so simply create your own functional transformation of it. They don’t call them pipelines for nothing!

This property also holds when doing analysis and interpretation of variant files (also just a text stream of records often compressed and indexed).

You never change a genotype in a VCF file from one thing to another by hand.

What you are likely to do is extend it with many annotations, analyze its fields to understand their distribution and statistics, and then perform a series of informed filters on many fields and annotations to find those few variants of importance.

But simply using a genomic index into a sorted stream of variant records no longer suffices to perform these tasks.

Thinking at the Speed of 1 Million Variant Histograms
If you think of a set of variants as a table with variants as the rows, you actually do most of your analysis in a column-oriented fashion.

Let’s break down the workflow of analyzing my 1 million variants of the exomes I have for myself, wife and son. I have a VCF file produced by early access to Real Time Genomics’ alignment and variant calling tools, but it could be any exome produced on any pipeline.

The first thing I might do is try to understand how I might define high-quality variants. RTG has a special score called the AVR score they use to incorporate much of their heuristics that can separate high quality from low-quality variants. Here is the histogram of that score from 0 to 0.1 (lower score equates to the lower quality):

AVR score histograms of 1 million variants for myself, wife and son

So I can see the majority of variants fall below the 0.05 threshold (a threshold RTG suggests for distinguishing “high quality” variants) and that a score of 0.02 might remove roughly half the variants.

These types of queries and investigations are desired at each step of the process. I might also look at read depth, genotype quality, allele frequency as annotated by 1000 Genomes, NHBI 6500 exomes, dbSNP and others. Then filter and sort on things like a variant’s protein-coding effect and predicted functional effect and the genotype pattern of my family.

Without getting into the details of each annotation and filtering step, it is clear that we are operating on only a handful of “fields” at a time for each step (for example, the AVR field, the allele frequency field, my family’s genotype fields, etc). We never need to look at the entire bolus of data composing a variant record and all of its annotations for any given operation.

This distinction may not matter when dealing with hundreds or thousands of variants, but to provide a user experience that does not inhibit the creative thinking process while analyzing hundreds of thousands and millions of variants, you must optimize the way you read data.

The goal is to perform an operation like sort a table, apply a filter or show a histogram faster than it takes the user to perform the mental action of choosing their next step: roughly 1.2 seconds. Simple operations like adjusting a filter should actually take less than a mere 0.07 seconds to match the mental speed of a human performing a simple step.

If you can achieve these speeds, then the user will have the experience of creatively flowing through the software at the speed of thought. If not, you risk disrupting the cadence of that flow or worse, causing a dreaded task switch while the user waits.

As an aside: this also can make it difficult for any software not running on the user’s own hardware to achieve these speeds given any action performed through a website incurs the latency of communicating to the server and back through the internet (on average 0.1 to 0.5 seconds).

Columnar Storage: The Old New Thing
The solution to optimized field-based access of tables is simple: store your data field-wise!

Columnar storage databases are nothing new and have a number of very nice properties like the fact that your compression ratios are ridiculously good when all your data of the same field type are stored contiguously. But they are not well suited to transactional workflows that involve adding, editing, and deleting rows. Luckily, as we discussed earlier, we can embrace write-once technologies in data science.

Recently, columnar storage has seen a resurgence of interest on the wave of large-scale distributed data processing popularized by the NoSQL movement.

An example of this is the recent announcement of Amazon RedShift: a cloud-based data analysis engine based on columnar storage designed to scale to petabytes of data.

Open source projects like PyTables have been providing columnar storage backends to Python’s scientific analysis platform for many big data scientific fields, including astrophysics and earth science.

Meanwhile, traditional RDBMS databases are growing the ability to use columnar storage for specific use cases like indexes on tables that don’t change frequently.

In genomics, once the data production portion of generating a variant file is complete, it makes sense to transpose the storage format from record based ones like VCF (and its in-development binary sibling BCF) to a columnar one.

May I Present: Tabular Storage Format (TSF1)

Columnar storage is also nothing new to us at Golden Helix. Since I started working on our code base 11 years ago (yes, you read that right), we have been storing data as a series of arrays ready to be read directly into the algorithms that powered our software. In fact, our first file format was not much more than that!

Believe it or not, we have gone through 3 file formats (with multiple versions) since then, and I’ve had a chance to learn about their strengths and weaknesses as our software adapted to the changing landscape of genomic analysis.

So from all that learning, and from thoughts no doubt inspired by Heng Li’s successful BAM and Tabix, there are three things that have motivated me to create a 4th file format:

Genomic data has a mix of attribute fields (like a variant’s RSID) and matrix fields that have an entity dimension (like Genotypes and Read Depths). It’s useful to keep those matrices aligned and in sync with a view of a source.
Columns are best stored in the longer locus dimension (i.e. one column for each sample or attribute of variants). But because this dimension can be millions of elements long, you need to segment your columns into chunks so you can read a given record efficiently.
Compression of your data in the file format always makes sense as long as you can work with blocks of data large enough to compress. (see Why Modern CPUs are Starving)

The first point is best illustrated by thinking about my RTG VCF file with my 3 samples in it.

The simplified view of my VCF file with 3 samples.

I’m calling fields like RSID and Ref Alleles Locus Attribute fields and Genotypes and AVR Scores Matrix fields (there is actually also one Entity Attribute field not shown. Can you guess what it is?). If I were to filter to a subset of variants, I will still want at hand all the matrix fields for those variants. If one were to flatten this to a single table (like our current file format does), you have to choose how to represent these matrices (collated by sample, or joined end-to-end, etc.).

Diagram of columnar storage broken up into regularly sized chunks

The second point about segmenting the columns is an implementation detail that results in a couple of important properties: namely record based access can be just as fast (if not faster) and you end up with units of data for storage that lend themselves to compression.

For most types of data, zlib compression will result in very good compression ratios. But with columnar data, you can take advantage of the uniform nature of the elements in your data to do things like shuffle the bits before compression. The Blosc meta-compressor can take information about the type-size of its data and considerably outperform zlib on things like columns of sorted numbers or uniformly sized strings (i.e. indexes).

With a tabular based storage system that has excellent record-wise lookups and the ability to have extendable indexing for genomic queries (like tabix), we can actually outperform the special purpose file formats we have used in the past in nearly every possible dimension (space, speed, usage modes).

Format	Uses	Storage	Read Modes	Write Modes
DSF3	Powers our spreadsheets. Has no concept of what is being stored in each axis.	Whole typed single columns	Column-only (row-wise read requires holding entire dataset in memory)	Append column only (appending rows requires rewriting entire dataset)
IDF2	Powers our annotation tracks (currently).	Records of strongly typed fields with a genomic coordinate indexed.	Genomic query resulting in records only. No tabular access.	Append records only (appending a field requires rewriting entire dataset)
DSM	Provides genomic annotation to our spreadsheets (marker maps).	Records of strongly typed fields with caches of genomic coordinates in columnar arrays.	Table indexes for records, but records guaranteed to be in genomic coordinate order.	Append records only. Append fields possible but results in a dataset rewrite.
TSF1	Annotation track (next release) and all usages in the future.	Chunked columnar typed fields and matrices.	Table index, field or record-wise. Genomic query with index computed.	Append fields. Append records requires only mutating the last chunk for each field.

How Tabular Storage Format (TSF) compares to our current special case storage formats.

Armchair Diagnosis: Sir, You Have Not-Invented-Here Syndrome
It may seem strange to you that we have been inventing our own file formats for the past 15 years of Golden Helix and now I’m doing it again.

I get it. Heck, I agree with you! I have been spending the last two years scouring Hacker News, proggit, twitter, and many great technical manuals, discussion groups and papers (and another) looking for an off-the-shelf solution that fits the bill for building a local (embedded) high-performance analysis software with big-data science data sets.

I’ve looked at MonetDb (columnar store client/server db), LevelDB (scalable key-value store), Kyoto Cabinet (another key-value store), PyTables (native python array storage on HDF5), raw HDF5, SciDb (massively parallel distributed array-based computation), and any number of server-oriented solutions like Oracle Exadata, SQL Server 2012 and MySQL’s ICE.

The closest thing to winning is PyTables (which I studied closely), but it falls short by nature of its Python implementation not allowing it to be run in background threads. For a file format that needs to power our massively multi-threaded GenomeBrowse visualization, this is a show stopper.

HDF5 is the runner-up, with its ability to store strongly typed arrays in any number of dimensions and configurations. It is in fact heavily used in our industry. Affymetrix uses it in their chromosomal microarray analysis and PacBio stores their primary sequence data output in it. But HDF5 is really just a container format on top of which you would have to build a chunk-columnar store with genomic indexing.

We have been happily using SQLite as our containment format since 2005 (all the formats listed above use SQLite). SQLite is itself a powerful embedded database engine and is the most mature piece of technology I have ever had the pleasure of working with.

SQLite is an amazingly good (and recommended) choice for an application file format. It supports multi-threaded and multi-process concurrent access with all the difficulties of different locking behaviors worked out across platforms. It has useful constructs like indexes and page caching all figured out. And it is very darn efficient in its use of space.

Most importantly, it is entirely based on a single file. Yes, one of those things you can copy, move, redistribute, archive, or even email. And it contains all your meta-data, index data, columnar data and potentially multiple sources of data.

All in a file.

Flat Files on a Desktop? But, but Dude… the Cloud!
I’m a big fan of the file.

It takes zero IT intervention to use at your own will. It can be on your Mac laptop as easily as your Windows desktop. It can be shared through Dropbox, FTP, HTTP and placed on a network share.

As a company, we have built our livelihood around the idea of taking best practice methods in genomics (often available in free tools like R and PLINK) and empowering people to do analysis they couldn’t easily do themselves.

How empowering would it be to require every user of our software to have a big iron database server to handle the data storage?

How empowering would it be if storage and analysis was done on rented computers accessed over the internet – a solution requiring a business model that cannot provide a fixed cost for your analysis to put in your budget and adds a recurring charge for you data just existing?

I’m a big fan of cloud services. I use Evernote addictively. Our own community’s web-based resources like NCBI dbSNP and the UCSC Gnome Browser are foundational to the field. I even understand why a data production pipeline may benefit from the elastic scaling provided by the cloud.

But once you stretch the imagination to see a world where you can do your exome and whole genome analysis on your own device, at the speed of thought and not need more storage than the size of the compressed data you started with, it’s hard to see how there is anything more empowering than that.

TSF took its first baby steps at Zocalo Coffee on Bozeman’s Main Street

Gun Slinging Cowboy Coding
So I’ve told you about the major points of TSF. But the devil is always in the details. This thing is going to take many man-years to see the light of day, right? Nope. It’s done.

And even cooler: it took only a week to write!

Well, I suppose I should be accurate and explain that it took two years of planning and preparation, many man-years of supporting infrastructure and framework first… and then a week to write. Well, and then another week to optimize and improve the genomic index, etc.

But it’s done. And… drumroll… it totally kicks ass.

Here is the size of some converted files compared to other formats:

What	TSF	IDF2	DSF3	Native
dbSNP 132	430MB	3.9GB	NA	822MB txt.gz, 6.4GB txt
dbSNP 137	700MB	6.3GB	NA	1.5GB .txt.gz, 10GB txt
dbNSFP v2.0	1.1GB	8.9GB	NA	2.6GB zip, 26.9GB txt
Trio RTG VCF	62MB	NA	726MB	60.4MB vcf.gz+tbi, 225MB vcf
My Exome GATK VCF	13MB	NA	50MB	17MB vcf.gz+tbi, 83MB vcf

TSF often beats the compressed size of native text representations and has the genomic index built-in.

When will the world see it?
The first of TSF’s assignments will be to completely replace IDF2 in our upcoming Golden Helix SVS 8.1 release. Most people may only notice the 80% or more reduction in file size of the annotation files they need to annotate and filter their data, but soon we will start taking advantage of the quick tabular access, easy joining, filtering and subsetting it is capable of providing. But more on that at a later time.

For now, I’m looking forward to some quieter productive hours to build out converters from all the many, many file formats in genomics to our new workhorse file format.

16 thoughts on “All I Want for Christmas Is a New File Format for Genomics”

Hyunmin Kim December 18, 2013 at 9:37 AM

Thanks for your valuable thread. I think the one of the important thing is who first and how many used in field? Although needed more time and more cost, most researcher may be get the nice result for them. sorry for my eng.
Happy christmas!

Reply ↓
1. Gabe Rudy December 18, 2013 at 10:04 AM
  
  Hey Hyunmin,
  
  I think what your are saying is what matters more than the technical aspects of a tool is how useful it is. I totally agree. Researchers are a scrappy and resourceful bunch and can get things done.
  
  We will be using TSF as a piece of backend technology to build tools to help people get things done as well. And hopefully allow more things to be easier and even feasible than before.
  
  Reply ↓
Naveen Michaud-Agrawal February 1, 2014 at 10:04 PM

Hi Gabe,
What do you use as the underlying storage implementation of TSF1? HDF5?

Also, have you seen ADAM (https://github.com/bigdatagenomics/adam), a recent project from Berkeley’s AMP lab? They followed a similar evolution in file design, finally settling on a hierarchical column storage model (Parquet) that was first introduced by Google (http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36632.pdf). As a result of their file format design, they are able to leverage a lot of the current big-data ecosystem (Hadoop MR, Spark, Shark, etc) for large scale genomics processing. Anyway, you might interested in how their data format differs from TSF1.

Naveen Michaud-Agrawal

Reply ↓
1. Gabe Rudy February 5, 2014 at 12:33 PM
  
  Hey Naveen,
  
  Thanks for the question. I’m still using SQLite as my storage container of the TSF1 data. Nothing beats it in terms of maturity, introspection and fine-control of performance such as cacheing behavior and multi-threaded access.
  
  Actually, one of the developers and researchers of ADAM contacted me and I talked to him in detaila bout hte project. It looks like a great approach, and I was excited to learn about Parquet (which Twitter built I believe).
  
  My focus in terms of tools is entirely user-driven applications that generally run on single-computers, so our goals are similar but not identical. I think it great to see very similar engineering choices being made though, as it validates that we are both seeing the core issues and choices in a similar way and reaching the same conclusions.
  
  Reply ↓
Claudio March 25, 2014 at 11:04 AM

Hello Gabe,
thank you for this instructive post. I’m a parallel software developer starting a project on genomics file storage and compression. I’m currently trying to understand the state of the art on open file formats. Could you please point me to any relevant source?
Thanks!
Kind regards,
Claudio

Reply ↓
1. Gabe Rudy March 25, 2014 at 4:27 PM
  
  Hey Claudio,
  
  Thanks for the feedback. Other than the format’s I mentioned in my post (HDF5, SQlite, PyTables), there is Parquet (the column storage format for Hadoop used by the ADAM folks from the above comments) and of course the internal format of the representational databases like Postgres (check out their jsonb format for example).
  
  In the genomic industry, there are other folks trying to make use-specific binary formats such as bcf (Binary VCF).
  
  Hope that get’s you started!
  
  Reply ↓
Phil Appleby April 16, 2014 at 10:21 AM

Hi Gabe,

Very interesting article thanks!

I have been tasked with creating a searchable / updatable datastore for genomic data (millions of SNPS and thousands of samples) and have been having a look at MongoDB (loading seems a little slow).

SQLite worries me as a solution for when the data becomes very large, though.

Thanks
Phil Appleby

Reply ↓
1. Gabe Rudy April 21, 2014 at 12:11 PM
  
  Hey Phil,
  
  Certainly the right tool for the job is my moto as well. Any time you have a data store that requires updating, then the approach I took might not make sense. That said, I have been looking at the idea of a “variant archive/search/distribution server” and the only thing that really needs to mutate from one batch of data to the next is the master index of all variants ever seen and links out to their individual tables.
  
  It’s not like you change the actual sample-variant data itself since that was machine generated.
  
  SQlite scales amazingly well in my experience. I’ve created single-file databases in the hundreds of gigabytes range and for read-only applications they have as good performance as reading from a more fully featured DB backend.
  
  Reply ↓
  1. Phil Appleby May 5, 2014 at 6:50 AM
    
    Hi Gabe,
    
    Thanks for your reply I won’t dismiss SQLite then.
    
    I think more than anything it’s time to experiment!
    
    Phil
    
    Reply ↓
  2. Phil Appleby May 13, 2014 at 9:35 AM
    
    Hi Gabe,
    
    I have now tried building a simple sqlite3 database using an exome .vcf file from an Affymetrix page (dimensions are 300000 markers and 1250 samples) and that results in a ~25G file which works well for point queries.
    
    If I can find the disk space I’m going to try for a 10,000,000 marker example in which I’ll try a crude “sharding” scheme (one db file per chromosome).
    
    Do you have any insight as to the best indexing strategy for sqlite? (Build tables with / without indices and then add indices later if required?).
    
    Phil
    
    Reply ↓
jua September 19, 2014 at 4:47 PM

Hello.

Please, could you tell a little bit more about the performance of MonetDB vs SciDB vs some other (such as Aerospike or Redis)

Reply ↓
Pingback: VarSeq: Making Variant Discovery and Gene Panels Easy | Our 2 SNPs…®
Pingback: In Pursuit of Longevity: Analyzing the Supercentenarian Whole Genomes with VarSeq | Our 2 SNPs…®
kortschak January 24, 2015 at 6:18 PM

Anything more concrete on this wrt a format spec?

Reply ↓
justin Paschall October 10, 2016 at 8:42 PM

this is a great post Gabe, still so relevant. Any chance of these formats being open sourced?

Reply ↓
1. Gabe Rudy Post authorDecember 1, 2016 at 11:26 AM
  
  Not at the moment as the write portion of it is very integrated into our code-base and other third party libraries.
  
  We actually have a C based reader for the format here: https://github.com/goldenhelix/tsf-c
  
  This is used by VSWarehouse backend to make TSF files readable through a custom Postgres FDW backend, which works fantastic!
  
  This file format has really shown to scale in this context.
  
  Reply ↓

@gabeinformatics

an "The Golden Helix Blog" OUR 2 SNPS…

All I Want for Christmas Is a New File Format for Genomics

16 thoughts on “All I Want for Christmas Is a New File Format for Genomics”

Leave a Reply Cancel reply