Please see our updated tutorials for CNV Analysis by visiting: http://goldenhelix.com/resources/SNP_Variation/tutorials/index.html
The following statement is representative of a common question that is posed to the Golden Helix support team:
“I followed all of the steps in the SVS7 CNV analysis tutorial, but my results seem kinda funny. The segment means are skewed to the left and it doesn’t look like I have any copy number gains. Can you tell me what to do? By the way, I’m using Illumina data.”
Let’s face it: CNV analysis is not easy. While SNP-based GWAS follows a largely standardized process, there are few conventions for CNV analysis. CNV studies typically require extensive craftsmanship to fully understand the contents of the data and explain the findings. As much as we would like to automate the process, there are always factors that require the researcher to make informed decisions or educated guesses about how to proceed. The Golden Helix website features a CNV analysis tutorial, but it is only a high-level introduction to the process. There are numerous variables that may affect the analysis workflow, and the platform used to generate the data is prominent among those variables. The tutorial is built around data from an older Affymetrix platform, and the analysis steps described there are generally applicable to most platforms. However, Illumina data has some unique features that require special attention.
Figure 1: Histogram of segment means, as seen in the SVS Plot Viewer, for a small study based on the Affymetrix 500K platform. Segmentation was performed with the SVS CNAM univariate method. Copy-neutral segments cluster near zero, while clusters of apparent copy number gains and losses are seen above and below zero, respectively.
The question above points out one of the most distinctive features of copy number data generated from an Illumina platform and processed with GenomeStudio. The workflow in the tutorial suggests that after univariate segmentation, a histogram of the segment means (the average value of the observed log ratios within the identified segments) should form 3 or more clusters that correspond to distinct copy number values. Figures 1 and 2 show this feature. Figure 1 is taken directly from the current CNV tutorial (a new tutorial will be coming out soon that demonstrates new features in version 7.4), and shows a large number of segments with means near zero (corresponding to a baseline copy number of 2), with smaller clusters of segments above and below zero that correspond to copy gains and losses, respectively. Figure 2 is a detail view of a similar histogram from a large study using the Affymetrix 6.0 array, where there is evidence for 5 copy number classes. Figures 3 and 4 are normal examples from Illumina studies, where the segment mean distribution looks very different. Figure 3 is from a fairly small study using the 1M-Duo array, and Figure 4 is from a larger study using the Omni1-Quad. In both of the Illumina examples, the majority of segments have a mean value near 0, as expected for copy-neutral regions. The variable regions, however, are skewed to the left rather than being balanced to the left and right as in the Affy examples. You will notice a cluster of values just left of zero that likely represents heterozygous deletions, and a very broad distribution of apparent homozygous deletions further to the left. I have been told by a customer who is using a custom iSelect array that they observed a similar pattern in his data.
Figure 2: Histogram of segment means from a large study using the Affymetrix SNP Array 6.0 platform. The clustering of segment means suggests 5 different copy number classes. Segmentation was performed with the SVS CNAM univariate method.
Figure 3: Histogram of segment means for a small study using the Illumina 1M-Duo platform. Segmentation was performed with the SVS CNAM univariate method.
Figure 4: Histogram of segment means for a larger study using the Illumina Omni1-Quad platform. Segmentation was performed with the SVS CNAM univariate method.
Reviewing the Illumina data from the perspective of the tutorial, or the perspective of somebody more familiar with data from Affy or aCGH, leads to several questions: Why are the homozygous deletions skewed so far to the left? Where are the copy gains? Why aren’t there more heterozygous deletions in the Omni1-Quad example? How should I analyze the data? Let’s look at each of these:
Why are the homozygous deletions skewed so far to the left?
Simple math. Log ratio (LR) values are calculated according to the formula log2 (observed R/expected R) where R is the normalized intensity for each marker. If there is no copy of a DNA segment present, the measured intensity for the corresponding probe will only reflect the minor background fluorescence from neighboring probes, if anything at all. The ratio of this weak signal to the expected or reference signal is near zero, and the log ratio is then a large negative number, which leads to the observed skewness. Various platforms and normalization methods account for this phenomenon differently, and it is especially pronounced in Illumina’s standard processing.
Where are the copy gains?
In CNV analysis, we must be very careful about assuming that something is a gain or loss just because of the mean intensity value in a segment. With Affy data and aCGH data, we are accustomed to seeing a balanced distribution of segment means above and below the copy-neutral region, and we call those “gains” and “losses”, respectively. But what we are really detecting is gains and losses relative to the reference sample. A subject may appear to have a small copy gain, but further analysis will often show that they have 2 copies of a common CNV segment where the reference has less than 2 copies. Samples whose intensity match the reference intensity have an LR value near zero, while subjects with higher or lower copy numbers have LR values above and below zero, respectively. If the reference sample has a homozygous deletion, or if the most common state in a pooled reference is for a common deletion, then any samples without the deletion will have a positive LR value and appear to have a copy number gain.
I don’t know a lot about Illumina’s data normalization techniques, but from the segmentation output, we can assume that Illumina’s processing reflects the concept that most CNV segments are essentially indels that have 0, 1, or 2 copies, and that 3+ copies is very rare. This is related to the fact that Illumina uses a multi-ethnic panel of 200 HapMap subjects to determine the expected intensity at each marker. Like SNPs, the frequency of common CNVs is variable across populations. The result of using multiple ethnicities in the reference group is that the expected intensity remains close to two copies for CNVs that are common in only one of the populations, and the copy-variable subjects appear to be losses. In my experience, the only signals that stand out well above zero in Illumina data are large, rare CNVs or cytogenetic anomalies that clearly have 3+ copies, confirmed by examining the B-allele frequency data.
We have experimented with alternate processing and normalization techniques for Illumina data, including an adaptation of the quantile normalization process used in SVS for Affy data, and we are able to get a more balanced distribution of segment means. However, when we closely examine the results, we aren’t detecting more copy number gains—we are simply re-centering the distribution based on a different reference. A common CNV that appears to have 0, 1, or 2 copies using the standard Illumina processing will appear to have 1, 2, or 3 copies using the alternate processing—the values all increase by one, and this is based entirely on the reference. The actual copy number counts are much more difficult to determine. This is why the recent WTCCC CNV study and others don’t try to assign copy numbers, and instead, say only that a CNV region has 3 distinct classes or 4 or 5 classes. Calling something a loss or a gain is usually ambiguous, and entirely dependent on the reference.
Why aren’t there more heterozygous deletions?
I think that the answer to this question is illustrated in Figure 5. This image shows histograms of the segment means for a single, small, common CNV detected on the Omni1-Quad using both of the segmentation algorithms in SVS. This particular CNV is covered by only seven markers on the array. The upper histogram, based on the multivariate segmentation method, clearly shows 3 copy number classes. We generally expect the copy-neutral class to be centered at zero, but you can see that the two largest classes (presumably copy-1 and copy-2) are pushed to either side of zero. Why does this happen? My guess is that the mean intensity for this region fell somewhere between 1 and 2 copies in the multi-ethnic reference panel used by Illumina to establish the expected intensity. An unfortunate side effect of this phenomenon is that the univariate segmentation clearly identifies only 2 copy number classes for this CNV. The univariate algorithm was run with a strict permutation p-value of 0.001. At that level of precision, this 7-marker segment that diverges into classes just above and just below zero is not clearly differentiated from the flanking regions, and the result is that the univariate segmentation calls long segments that pass right through the region without any variation identified. (I should mention that other popular CNV detection methods also failed miserably at identifying this particular locus.) The “cnvi-” intensity probes on the Illumina arrays are generally targeted toward common CNVs, and we can see from this example that the center of the distribution can move away from zero for common CNVs, thereby confounding the segmentation results. There are hundreds of CNV loci like this one represented on the Omni1-Quad. That is at least part of the reason why heterozygous losses appear to be under-represented in the overall histogram of segment means for the Omni1-Quad. I should also point out that the problem of the intensity distributions shifting with common CNVs is not unique to Illumina–it happens on other platforms as well.
Figure 5: Distribution of signal intensity for a small, common CNV identified with the Illumina Omni1-Quad array. The baseline intensity for copy-neutral subjects moves away from the expected value of zero, likely as a result of copy number losses being common in the reference population. This shift in baseline intensity results in a failure of the univariate segmentation algorithm to distinguish between the neutral and hemizygous copy states. Multivariate segmentation correctly distinguishes between all three states.
And the most important question: How should we analyze the data?
Based on my experience analyzing multiple Illumina CNV datasets, I have come to the conclusion that a thorough analysis requires segmenting the data twice: once with the univariate method, and once with the multivariate method. The univariate method considers only one sample at a time, and is ideal for detecting rare and/or large CNVs. The multivariate method simultaneously considers all subjects, and is ideal for detecting small, common CNVs. Figure 6 illustrates the differences between the two methods, using an example with 200 subjects. An interesting workflow possibility that I have not yet tested would be to begin with multivariate to identify the small, common CNVs with highly variable distributions, then remove the markers in those regions from the dataset before running univariate segmentation to find rare CNVs. This might adress the common problem of very large CNVs being broken into numerous small segments in the segmentation output because of shifts in the reference intensity.
Figure 6: Heatmaps of segmentation results for 200 subjects assayed on the Affymetrix 500K platform. The upper panel shows univariate segmentation results, and the lower panel shows multivariate segmentation results. Red segments are probable deletions, while the blue segments are probable copy gains. (A) Rare copy gains detected only by univariate segmentation. (B) Rare deletions detected only by univariate segmentation. (C) Small, common CNVs detected poorly by the univariate method, but with better resolution by the multivariate method. (D) This common CNV was ignored as an outlier by univariate segmentation, which considers only one subject at a time, because it is covered by only one marker. Multivariate segmentation, which simultaneously considers all subjects, was able to recognize that this single marker clearly has multiple copy states. (E) Large, common deletion that was accurately detected by both segmentation methods. The multivariate method provides better resolution of the segment endpoints.
The online CNV tutorial demonstrates a simple script for classifying CNVs into discrete categories based on the mean segment intensity. This script can be used for Illumina data, but with limitations. It is probably safe to use it with univariate segmentation results to classify a two-state system of homozygous losses versus all other states. However, it is probably best not to try to apply a three-state model using a genome-wide classification threshold due to the fact that the threshold separating the clusters is different at every locus. For common CNVs, I suggest using the segmentation covariates (segmentation covariates is another term used for mean segment intensity within SVS) from the multivariate segmentation output as the primary data for analysis. Use regression or numeric association tests to compare the covariates with your phenotype, manually review the segment means for any significant results in order to determine the proper classification thresholds, and then report the copy-number states accordingly. For the best results, regardless of platform, it is necessary to do the classification individually for each CNV region. The large WTCCC CNV study published earlier this year used 16 different workflows for classifying common CNVs depending on the distributional properties of each. Perhaps they may have over-thought the problem, but there is definitely a lesson to be learned from their experience.
I have one final warning about CNV analysis based on LR data from GenomeStudio: Always check the LR distribution for differences between male and female subjects. Males tend to have slightly higher overall intensities, probably due to handling of the sex chromosomes in the normalization stage. If you are analyzing intensities or segment means, and especially if your phenotype has a gender imbalance, then this feature may confound your results. I suggest that all analyses be corrected for gender.
…And that’s my 2 SNPs.