Secondary Analysis 2.0 – Part III

Detection of CNVs in NGS Data

Our Secondary Analysis 2.0 blog series continues with Part III: Detection of CNVs in NGS Data. We will give you an overview of some design principles of a CNV analytics framework for next-gen sequencing data. There are a number of different approaches to CNV detection. The published algorithms share common strategies to solve the underlying computational problems. In principle, CNV detection methods incorporate three major steps:

  1. Data preprocessing to correct for biases in the data and create a baseline for detecting variation.
  2. Assign copy-number states.
  3. Large Event Calling by defining the boundaries of multi-target events using a segmentation algorithm.

Preprocessing

In this phase, the algorithms correct for systematic biases. They normalize the data establishing a baseline for detecting variation. The two most common methods for addressing systematic bias are GC-content and mappability correction. The most common methods for normalization are Principal Component Analysis (PCA) and normalization relative to reference samples. One source of bias in the coverage data is CG-content bias. It is known that regions with high or low GC-content tend to have lower mean read depth due to PCR efficiency in amplification. When correcting for GC-bias, CNV calling algorithms generally will either filter out regions with extreme GC-content or perform normalization to account for the bias. Algorithms that use the filtering approach include XHMM and OncoSNP-SEQ, while algorithms using normalization to account for GC content include CLAMMS, ReadDepth, Patchwork and Control-FREEC.

Another source of bias in the coverage data is mappability bias. Mappability for a given region is the probability that a read originating from the region is unambiguously mapped to it. Regions with low mappability tend to produce more ambiguous reads, which can cause errors in CNV detection. Generally, algorithms will address mappability bias by filtering out low mappability regions. Methods that address mappability bias in this way include CODEX, Control-FREEC and OncoSNP-SEQ.

Several CNV detection algorithms perform their primary normalization via Principal Component Analysis (PCA) on the coverage data. PCA uses an orthogonal transformation to convert a set of observations into a set of linearly uncorrelated variables called principal components. The CONIFER and XHMM algorithms perform normalization using PCA by removing the k strongest principle component. As an alternative to PCA, it is also possible to perform normalization using a set of reference samples. This is done by using deviation from the average coverage in the reference samples as an indicator of CNV occurrence. Generally, this is done by computing evidence metrics, such as a Z-score, relative to the control samples. This approach normalizes out biases present across the reference samples, thereby reducing or eliminating the need to explicitly correct for systematic biases such as GC-content and mappability. Algorithms that rely on reference samples for CNV detection include CoNVaDING, VisCap, CLAMMS and CNVkit.

While PCA based normalization has the advantage of handling varied and even unknown sources of noise in the data, the approach has two major disadvantages compared to reference sample normalization in a clinical setting. First, it requires significantly more samples to provide robust results. Clinical labs may have as few as 15-20 samples as they validate and configure a test. Reference sample-based normalization can provide reasonable results with far fewer samples. Secondly, the choosing of the k strongest principle components to factor out of the data is a somewhat subjective parameter, yet highly influential on the final result. For clinical validation of bioinformatics methods in a genetic test, algorithms should be robust. This means that small changes in inputs will not lead to dramatically different results. Additionally, they need to be as transparent as possible in regards to the… to continue reading, I invite you to download a complimentary copy of my eBook. You can do so by clicking the button below.

Leave a comment

Andreas Scherer

About Andreas Scherer

Dr. Andreas Scherer is CEO of Golden Helix. The company has been delivering industry leading bioinformatics solutions for the advancement of life science research and translational medicine for over a decade. Its innovative technologies and analytic services empower scientists and healthcare professionals at all levels to derive meaning from the rapidly increasing volumes of genomic data produced from next-generation sequencing. With its solutions, hundreds of the world’s hospitals and testing labs are able to harness the full potential of genomics to identify the cause of disease, develop genomic diagnostics, and advance the quest for personalized medicine. Golden Helix products and services have been cited in thousands of peer-reviewed publications. Golden Helix is also on the Inc 5000 list of the fastest-growing private companies in the US. He is also Managing Partner of Salto Partners, Inc, a management consulting firm headquartered in Nevada.  He has extensive experience successfully managing growth as well as orchestrating complex turnaround situations. His company, Salto Partners, advises on business strategy, financing, sales, and operations. Clients are operating in the high-tech and life sciences space. Dr. Scherer holds a Ph.D. in computer science from the University of Hagen, Germany, and a Master of Computer Science from the University of Dortmund, Germany. He is author and co- author of over 20 international publications and has written books on project management, the Internet, and artificial intelligence. His latest book, “Be Fast Or Be Gone”, is a prizewinner in the 2012 Eric Hoffer Book Awards competition, and has been named a finalist in the 2012 Next Generation Indie Book Awards! 

View all posts by Andreas Scherer →