Science and Methodology Behind Whole Genome Copy Number Association Analysis
The following outlines the science and methodology behind the Copy Number Analysis Module.
Problems with Existing Copy Number Analysis Methods
To date, most copy number analysis methods have been forced to rely either on Hidden Markov Models (HMMs) or "segmentation" methods that have proven to be sub-optimal due to high false discovery rates, low sensitivity, or slow computational speed.
Hidden Markov Models
In recent years, two survey papers by (Willenbrock, 2005), (Lai,2005) have been published comparing various methods for finding copy number segments. The authors found that HMMs performed poorly (Figure 1), with high false discovery rates (~0.40-0.60) and low sensitivity (~50-80%), even when applied to simulated data with known properties. Additionally, intensity data, which is used for detecting and identifying copy number segments, is often “noisy,” causing concern among researchers. If the data that feeds the challenged HMM algorithms is itself suspect, then the probability of the final result being reliable is even more questionable.
The benefit of HMMs, however, is that they are relatively fast, the primary reason they are the most widely implemented methods today.
Figure 1. False Discovery Rate vs. Sensitivity
The Promise and Challenge of Segmentation Methods
In the Willenbrock study mentioned above, the authors concluded that the most effective method for finding copy number segments is Circular Binary Segmentation (CBS), a segmentation approach based on finding change-points in data. Such and approach has been implemented in DNAcopy, for example. Applying CBS to the same simulated data set, the authors were able to achieve a 0.06 median false discovery rate with 0.88 sensitivity. However, though effective for finding segments and despite speed optimizations by Venkatraman2, CBS is not computationally efficient for whole genome analysis. Case in point: analysis on Affymetrix 500K data has been shown to take over 20 minutes per sample and roughly 45 minutes per sample on Illumina 550K data (Venkatraman, 2007).
The challenge with segmentation methods is that in order to find the optimal segmentation, all possible change-points need to be evaluated, creating a combinatorial explosion. For example, if there are 10 copy number segments positioned across 2,000 probe intensities, then there are 200010 places these segments may lie (roughly a 1 followed by 33 zeros, or one decillion). Given a chromosome may have as many as a hundred change-points and today’s whole genome arrays contain over 50,000 intensity values for a given chromosome, the search space can easily exceed 50,000100 possible change-points.
This seemingly impossible series of calculations is what has led researchers to adopt various heuristics to make this process computationally viable, as was done with CBS, or by avoiding segmentation altogether.
Challenge of Making Integer Copy Number Calls
When developing and testing our methods we found that single nucleotide polymorphism (SNP) probe intensity data, which is used for detecting and identifying copy number segments, is often “noisy,” making it difficult to make unambiguous integer copy number calls.
The two histogram plots below represent the distributions of males versus females generated from normalized probe intensities across the X chromosome using an Illumina 330K data set with 1,000 samples.


Because males have one copy of the X chromosome and females 2, we expected to see two distinct distributions with a tight distribution around the mean. To test this, we first normalized the intensities using publicly available Illumina HapMap reference data (top) and got a large overlap between the two distributions as well as rather wide distributions, particularly among the females. We then normalized using the sample as its own reference (bottom) - just the controls - and got tighter distributions and less overlap.
Nevertheless, the fact that there is this substantial overlap makes it problematic to make unambiguous integer copy number calls, especially for small copy number variations. If the data that feeds a copy number calling algorithm (such as Hidden Markov Models) is suspect, then the probability of the final result being reliable is even more questionable. For this reason, we focused our efforts almost exclusively on using the copy number intensities themselves for analysis instead of trying to convert them to integer calls.
Golden Helix Dynamic Optimal Segmenting Algorithm
As discussed above, Hidden Markov Models (HMMs) have proven to be problematic for uncovering true copy number variations and though existing segmentation methods, such as Circular Binary Segmentation (CBS), have shown great promise, they are too inefficient for whole genome analysis.
Golden Helix Dynamic Optimal Segmentation Algorithm
In our initial research, we found that the same powerful technology used for segmentation in our predictive analytics tools, HelixTree and Optimus RP, could be applied to effectively overcome the problems of existing copy number analysis methods.
In fact, for nearly a decade, we have been using an optimal segmenting algorithm that exhaustively searches through all possible change-points in data to find the optimal segmentation without succumbing to the inherent combinatorial explosion. The algorithm, based on published work by Dr. Douglas Hawkins (Hawkins, 1972, 1973, 1976, 2002) of the University of Minnesota’s School of Statistics, utilizes dynamic programming to meticulously look through all possible change-points to uncover those optimizing the sum of squared deviations from the mean within each segment. A creative application of this segmenting approach, as implemented in the Copy Number Analysis Module (CNAM), has enabled us to effectively solve the copy number computation problem.
For more details see the section, The Copy Number Segmentation Algorithm in the latest HelixTree Manual.
Univariate versus Multivariate Segmenting
CNAM offers two types of segmenting methods, univariate and multivariate. Both methods are based on the same algorithm, but use different criteria for determining segment change-points.
Figure 1.
Univariate Segmenting
The univariate method (Figure 1 - right) segments each sample independently. Each time there is a change-point for a given sample, an additional covariate is created. The value of the covariate is the mean intensity within the original segment for that sample. The result is a spreadsheet showing all change-points found among all samples. The univariate approach is good for finding individual variations but may miss shorter copy number variations due to noise in the data. It is for this reason that a multivariate approach is also provided.
Multivariate Segmenting
The multivariate method (Figure 1 - left) segments all samples simultaneously, finding partial consensus copy number regions based on variation across many samples. In this case, every sample shares the same segment boundaries and the resulting covariates are the mean intensities for the given sample within the segment.
If there are consistent positions for copy number variation across multiple samples, the copy number segments will best be found using the multivariate method. It does make the fundamental assumption, however, that the boundaries for each sample are somewhat near one another. In reality there may not always be consistent copy number segments across multiple samples and the multivariate method may miss some segments where boundaries are irregular. Thus the multivariate method is preferable for finding very small copy number regions and for finding partially conserved regions that may be useful for association studies.
Permutation Testing for Verifying Copy Number Segments
Solving the copy number computational problem was not the final step in the process of optimizing a copy number analysis method. The best results remained somewhat illusive due to the noise inherent in intensity data. There was also an added challenge in segmenting approaches: determining the correct number of copy number segments supported by the data.
How then do we test if adjacent segments are statistically significant? A paired T-test comes to mind, and indeed is a reasonable approach (Musser, 1999). However, this approach does not account for the multiple testing involved in picking the optimal change-point out of all of the possible ones. A Bonferroni adjustment can be used, but ends up being overly conservative because it does not account for correlations between adjacent points.
To overcome these last hurdles, we supplemented our optimal segmentation algorithm with an efficient permutation testing approach, designed to cut through the noise and uncover true segments of copy number variation by determining and validating their statistical significance. The algorithm is as follows:
- 1. Choose a type I error threshold p (for instance 0.01).
- 2. For each pair of adjacent segments in the optimal segmenting, calculate the intra-segment sum of squared deviations from the mean statistic.
- 3. Calculate a distribution of intra-segment sum of squared deviations from the mean statistics on randomly permuted data from the two segments as follows:
- a. Set count = 1
- b. For each of 10/p – 1 iterations
- i. Randomly shuffle the data (note for multivariate response, that columns of probe intensities are kept together – think of it as shuffling the labels)
- ii. Calculate the optimal two-way segmenting on the randomly shuffled data
- iii. If the sum of squared deviations from the mean statistic is less than or equal to the unpermuted statistic, set count=count+1
- iv. If count/(10/p) > p, stop – the segment pairs are not significant
- 4. If any pair of segments is not significant using the permutation procedure in step 3, the k change-point segmenting is not significant. Revert to the best k-1 way segmenting and repeat from step 2. If all pairwise segments are significantly different using the procedure in step 3, we stop and output the k change-points.
In short, each pair of adjacent segments must be statistically different. After the optimal segmentation process is completed, the data for adjacent segments is randomly shuffled. The shuffled data is optimally binary segmented, and the resulting sum of squared deviations from the mean is compared against the original segment pair. This process is repeated numerous times, according to a specified p-value threshold, until the original segment pair is determined to be significant versus the permuted ones.
If not, the segment is discarded as insignificant. If all adjacent segments are statistically different, the data supports the segmentation. Otherwise, the next lower cardinality of optimal segmentation is evaluated.
Generating Copy Number Covariates for Association Analysis
Once we have a solid method for finding segment boundaries, generating covariates for whole genome analysis is relatively straightforward. For the univariate case we look across all samples and each time a new segment is introduced for any sample, we introduce a new covariate that starts at that segment and ends at the beginning of the next segment found across all samples. The value of the covariate is the mean not within this pseudo segment, but rather within the larger segments that each sample has been divided into. For the multivariate case, all samples share the same change-points, and the mean log R intensities for a given sample within a given segment form covariates for association. We then can then join phenotypic information with these continuous covariate and perform association tests, using either logistic regression or segmentation. Figure 1 below is a screenshot showing two covariate segments joined with additional phenotype information.
Depending on how stringently parameters are set prior to running the segmentation algorithm, both the univariate and multivariate segmenting methods will create a fair of number of covariates to be used for association analysis. This number, however, will be orders of magnitude less than the total number of original intensity values, thereby reducing the multiple testing penalty.
Figure 1. Covariate Segment Spreadsheet
Validation and Performance
Validation
Having successfully implemented the permutation testing approach for copy number analysis and developed the CNAM tool, a series of tests were conducted on the same simulated data used in the Willenbrock study (Willenbrock, 2005), but with better results than both Hidden Markov Models and DNAcopy’s Circular Binary Segmentation (CBS) method (Figure 1). CNAM was able to achieve a median false-discovery rate of 0.0 with a sensitivity of 0.90 versus DNAcopy (0.06, 0.88) and Hidden Markov Models (~0.40, ~0.80).
Figure 1. False Discovery Rate vs. Sensitivity
During the development and testing of our methods, we were also able to look at several public and private whole genome data sets from both Affymetrix (500K) and Illumina (330K, 370K, 550K and 1M) genotyping platforms. We segmented Affymetrix and Illumina Hapmap CEPH samples using the univariate segmenting procedure. We found that the 370K and 1M platforms found similar regions of variation, despite the higher density of the 1M array. With the Affymetrix 500K CEPH samples, we were able to find many of the regions previously published by (Redon, 2006).
We also had the opportunity to look at replicated experiments, and analyze these with the multivariate approach. The higher signal to noise ratio resulted in lower false discovery rates – another benefit of multivariate procedures.
When we ran the multivariate procedure on Affymetrix data, we found a few regions that were highly variable across many samples. In both Affymetrix and Illumina platforms, single SNPs of large intensity magnitude were often found. Some of these are clearly poorly performing probes. Some may be single-snp CNVs, but we were not able to investigate this.
Performance
To complement its accuracy, the computation time of the algorithm is also dramatically faster than DNAcopy's CBS, taking roughly five minutes per sample to analyze Affymetrix 500K data and six minutes for Illumina 550K versus DNAcopy's ~20 minutes and ~40 minutes respectively (Venkatraman, 2007). To speed the process even further, the algorithm can be run using multi-core systems.
Whole Genome Copy Number Association Workflow
Download the following PDF for a detailed step-by-step tutorial (22 pgs.) on a recommended whole genome copy number association workflow using CNAM and HelixTree:
Whole Genome Copy Number Association Workflow
The tutorial is broken down into seven steps, each consisting of one or more processes. They are:
- 1. Generate log ratios versus reference samples
- 2. Identify markers to exclude
- 3. Correct for batch effects and stratification
- 4. Perform whole genome log ratio association tests
- 5. Run segmenting algorithm to generate segmenting mean values and covariate table
- 6. Perform association analysis on copy number segment covariates
- 7. Visualize segmenting results
Hawkins DM (1972) On the choice of segments in piecewise approximation. Jour. Inst. Math. Applications, 9:250–256.
Hawkins DM and Merriam DF (1973) Optimal zonation of digitized sequential data. Jour. Math Geology, 5:389–395.
Hawkins, DM (1976). Point estimation of the parameters of piecewise regression
models. Applied Statistics 25:51–57.
Hawkins, DM, (2002). Fitting multiple change-points to data. Computational Statistics and Data Analysis, 37:323–341
Lai, W., Johnson, M., Kucherlapati, R., and Park, P.: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 2005. v. 21(19):3763-70.
Musser BJ (1999) "Extensions to Recursive Partitioning" Ph.D. Thesis, University of Minnesota School of Statistics.
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, González JR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, Macdonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME. Global variation in copy number in the human genome. (2006) Nature. 444:444–454.
Venkatraman ES and Olshen AB. (2007) A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 23:657–663.
Willenbrock, H. and Fridlyand, A.: A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005 v. 21(22):4084-91