After my latest blog post, Jeffrey Rosenfeld reached out to me. Jeff is a member of the Analysis Group of the 1000 Genomes Project and shared some fascinating insights that I got permission to share here:
I saw your great blog post about the problems in the lack of overlap between Complete Genomics and 1000 Genomes data. I just had a paper published that addresses these same issue that I think you and the Golden Helix team would find interesting:
Thanks for the note, and I’m glad you found my post interesting.
I actually read your paper already as it has been consistently echoing through the twitterverse recently.
Great read! I definitely found the variant set comparison between 1kG and CGI very interesting.
In the case of this over-filtering 1kG did in their recent release. I wish I could get to the bottom of what their filter was that culled these variants. I imagine it has something to do with being close to an indel or in some region with low mappability or something.
I understand why they might try to filter based on such a list of “trouble” regions.
I have been curious if there is some metric like mappability, repeat-master regions or large segmental duplications that when compared to these regions that are enriched for systematic differences between CGI and 1kG is very good at detecting problematic regions.
I would find such a validated tool to score or flag problematic regions very useful in analysis.
One approach I think looks very promising was in a recent publication called “Genomic Dark Matter” where Hayan Lee basically simulates short-reads (from the reference genome) to mimic a sequencing protocol or platform (like Illumina 100bp-PE) and then checks how well individual bases get accurately covered with the aligned simulated reads.
Anyway, these are just thoughts I’ve been having on finding ways to make these types of discrepancies see more manageable and less scary, but that might not have an easy answer 🙂
Glad you enjoyed my paper. I am actually part of the 1KGP and it is a very interesting process seeing how they try to deal with things. Their conference calls are very interesting since I get to hear all of these topics discussed. Here is a synopsis of why I think the variant lists show up the way they do:
1. Calling variants of low coverage (4x) sequencing is very hard and leads to tremendous variation among callers. The 1KGP has tried many different callers from different groups and their results tend to vary more than one would want.
2. Up until now, SNPs, indels and SVs are called separately and then integrated afterwards. This is admittedly not the ideal way to call the variants, but currently, there are specialized callers for each type of variant and there is no caller that can accurately call SNPs, indels, MNPs and structural variants all at the same time.
3. The project is more concerned about trying to reduce false positives than false negatives. Due to the error in the sequencing data there could be evidence of a huge amount of variants, and they want to avoid overcalling.
The project is definitely working on determining regions that are callable and not-callable, but it is still not ready for release. One current approach is to annotate sequences representing highly repetitive sequences to catch reads that map to a very large number of places in the genome and confuse the analysis.
Currently, they are working on phase 2 which involves the use of callers such as freebayes, cortex and a new version of GATK which do either haplotype-based calling or something akin to de novo assembly. I think that these callers will do a better job, but we are still comparing and assessing them.
I saw that paper from Hayan and I have had a few conversations with her and Mike Schatz about using their algorithm to produce metrics as to what read length is needed for different projects and whether paired-end is valuable. Illumina would always like people to do the longest paired-end reads possible so they make the most on reagents, but there are definitely cases where short reads would be sufficient (or 95% as good) which would be much cheaper.
Thanks for this insight.
I assumed they would want to lean towards reducing false positives over false negatives, but it’s good to know that more definitively.
It’s always great to hear that new and better things are coming down the pipe!
I’ve always been impressed though on how Complete Genomics seemed to have led the field in many of these informatics techniques (local indel re-alignment, haplotype/phased calling etc).
re: Hayan paper: That’s true, you could use those simulation of reads to mappability scores to get a good idea of what the most cost-effective protocol is for a given application. Although paired-end protocols can give evidence for structural variants I suppose, you’re probably right that single-end and shorter reads with the same or higher coverage gives you good small indel and SNV calls for certain applications at a more attractive cost.