Detection of CNVs from NGS Data - Secondary Analysis 2.0 Series

Detection of CNVs in NGS Data

Our secondary analysis 2.0 blog series continues with Part III: Detection of CNVs in NGS Data. We will give you an overview of some design principles of a CNV analysis framework for next-gen sequencing data. CNV detection is one of the more nuanced challenges in NGS data analysis, and there are a number of different approaches to address it. The published algorithms share common strategies to solve the underlying computational problems. In principle, CNV detection methods incorporate three major steps:

Data preprocessing to correct for biases in the data and create a baseline for detecting variation.
Assign copy-number states.
Large Event Calling by defining the boundaries of multi-target events using a segmentation algorithm.

Preprocessing

In this phase, the algorithms correct for systematic biases. They normalize the data establishing a baseline for detecting variation. The two most common methods for addressing systematic bias are GC-content and mappability correction. The most common methods for normalization are Principal Component Analysis (PCA) and normalization relative to reference samples. One source of bias in the coverage data is CG-content bias. It is known that regions with high or low GC-content tend to have lower mean read depth due to PCR efficiency in amplification. When correcting for GC-bias, CNV calling algorithms generally will either filter out regions with extreme GC-content or perform normalization to account for the bias. Algorithms that use the filtering approach include XHMM and OncoSNP-SEQ, while algorithms using normalization to account for GC content include CLAMMS, ReadDepth, Patchwork and Control-FREEC.

Another source of bias in the coverage data is mappability bias. Mappability for a given region is the probability that a read originating from the region is unambiguously mapped to it. Regions with low mappability tend to produce more ambiguous reads, which can cause errors in CNV detection. Generally, algorithms will address mappability bias by filtering out low mappability regions. Methods that address mappability bias in this way include CODEX, Control-FREEC and OncoSNP-SEQ.

Several CNV detection algorithms perform their primary normalization via Principal Component Analysis (PCA) on the coverage data. PCA uses an orthogonal transformation to convert a set of observations into a set of linearly uncorrelated variables called principal components. The CONIFER and XHMM algorithms perform normalization using PCA by removing the k strongest principle component. As an alternative to PCA, it is also possible to perform normalization using a set of reference samples. This is done by using deviation from the average coverage in the reference samples as an indicator of CNV occurrence. Generally, this is done by computing evidence metrics, such as a Z-score, relative to the control samples. This approach normalizes out biases present across the reference samples, thereby reducing or eliminating the need to explicitly correct for systematic biases such as GC-content and mappability. Algorithms that rely on reference samples for CNV detection include CoNVaDING, VisCap, CLAMMS and CNVkit.

While PCA based normalization has the advantage of handling varied and even unknown sources of noise in the data, the approach has two major disadvantages compared to reference sample normalization in a clinical setting. First, it requires significantly more samples to provide robust results. Clinical labs may have as few as 15-20 samples as they validate and configure a test. Reference sample-based normalization can provide reasonable results with far fewer samples. Secondly, the choosing of the k strongest principle components to factor out of the data is a somewhat subjective parameter, yet highly influential on the final result. In clinical bioinformatics, algorithms used for genetic test validation should be robust. This means that small changes in inputs will not lead to dramatically different results. Additionally, they need to be as transparent as possible in regards to the… to continue reading, I invite you to download a complimentary copy of my eBook. You can do so by clicking the button below.

Download the full eBook