Golden Helix Copy Number Segmentation Algorithm
25.10.1 Overview
The copy number segmenting algorithm is designed to find local variations over markers. The segmenting process is optimized by working at three levels:
- If desired, the region of markers is subdivided into a moving window of sub-regions.
- A unique segmenting algorithm is applied to find multiple segments wherever possible, and by implication, segment boundaries which we term “cut-points”.
- A permutation algorithm is applied to validate the found cut-points.
CNAM offers two types of segmenting methods, univariate and multivariate. These methods are based on the same algorithm, but use different criteria for determining cut-points. The multivariate method segments all samples simultaneously, finding general copy number regions that may be similar across all samples. This method is preferable for finding very small copy number regions, and for finding conserved regions that may be useful for association studies. For a given sample, the covariate is the mean of the intensities within each segment for that sample. If there are consistent positions for copy number variation across multiple samples, the copy number segments will be found. In reality there may not always be consistent copy number segments across multiple samples. The univariate method segments each sample separately, finding the cut-points of each segment for each sample and outputting a spreadsheet showing all cut-points found among all samples.
|
25.10.2 Obtaining Segments with a Moving Window
CNAM allows segmenting with a moving window, where the segmentation algorithm is applied to each sub-region created by the window. Using the optional moving window will result in an overall faster analysis; however, the results may not be as robust as analyzing the entire chromosome simultaneously. We implement an approach to windowing that minimizes edge effects as the moving window moves along the chromosome:
- Start out with a sub-region of s markers.
- Perform the segmenting algorithm on the current sub-region.
- If two or more cut-points were found, meaning that there were more than two segments found, use all the
cut-points except the last one and begin the next sub-region at the second-to-last cut-point. Use s markers.
Otherwise:
- If no cut-points could be found at all to subdivide the one sub-region, the next sub-region is started in the
middle of the current sub-region and extended for s markers. No cut-points are yielded in this case.
NOTE: This may mean that a region of copy number variation which started before this sub-region will be considered to be extended through this sub-region and possibly for much longer. We do not arbitrarily subdivide regions based on moving-window boundaries–instead, this whole procedure is meant to find valid cut-points throughout the main region.
- If the sub-region was successfully segmented into exactly two segments, yielding exactly one cut-point, and the cut-point is in the first half of the sub-region, this cut-point is accepted, and the next sub-region is started at that cut-point and extended for s markers.
- If the sub-region was successfully segmented into exactly two segments, yielding exactly one cut-point, but the cut-point is in the second half of the sub-region, the apparent cut-point may be due to an edge effect. Thus, the cut-point is not used and the moving-window sub-region is changed so as to get better segmenting. If the size of the sub-region is now still s, the sub-region is simply extended so that the total size of the sub-region is 2s. If the size of the sub-region is already 2s, the next sub-region, which goes back to being length s, is started at the former start plus s∕2.
- If no cut-points could be found at all to subdivide the one sub-region, the next sub-region is started in the
middle of the current sub-region and extended for s markers. No cut-points are yielded in this case.
- Finally, when the end of the main region is encountered, whatever cut-points are obtained from the sub-region are accepted.
You may set the window size as one of the optional segmenting parameters (25.3.3).
25.10.3 Obtaining Segments without a Moving Window
Selecting not to use a moving window in the segmenting analysis results in the sub-region spanning an entire chromosome. The segmentation algorithm is then applied to the whole chromosome. This is more computationally intensive, and the most trusted procedure. Note, that it may be necessary to increase the Max segments per window to more than the default of 20, as a given chromosome may well have more than 20 regions of copy number variation.
25.10.4 Copy Number Segmentation within a Sub-Region
The segmentation algorithm exhaustively looks through all possible cutpoints for a given region of data, to minimize the sum of squared deviations of the data from the mean of its respective segment. Given n points or intensities, and k cutpoints defining k + 1 segments, the number of possible cutpoints is O(nk). That is, the search space explodes exponentially. This exponential search space perhaps accounts for the explosion of algorithms that use something approaches or approximate segmenting approaches to find the segment regions.
Ironically, since 1999, within the tree splitting code of HelixTree, we have had an optimal algorithm that finds segment boundaries that overcomes the exponential search problem. This algorithm, due to Hawkins, see [Hawkins 1972, Hawkins 1973, Hawkins 2002], is based on dynamic programming. It exhaustively searches through all possible cutpoint positions to find an optimal segmenting of the data. We employ this approach with a few modifications to the copy number problem.
The segmentation algorithms in HelixTree use “segmentation for a normally distributed response for a continuous-ordinal predictor”. This mode of segmentation is partly analogous to a univariate counterpart used in creating tree models and documented in 26.3. However, there are important differences, the main one being that no final p-value calculation for the entire sub-region is attempted because the main emphasis here is on local variation rather than an entire “split”. Also, the tree models do currently not implement a permutation testing procedure for determining split cardinality.
NOTE: The following points apply for the copy number segmentation algorithms:
- Within the sub-region under analysis, we consider the number of markers to correspond to the number of observations.
- These markers are considered to be implicitly numbered consecutively. The implicit marker number is used as the (continuous-ordinal) “predictor” variable.
- The object of this segmentation is to find which markers border areas of significant copy number variation–that is, to find for which regions of markers there is at least one sample for whom the log2 ratio varies significantly between the regions, presumably indicating copy number variation between those regions.
- As mentioned above, we label the borders between these regions as “cut points”.
In the multivariate algorithm, the samples (from the respective subjects) are considered to be the respective multivariate “dimensions” of this analysis. In the univariate algorithm, the dimension of the analysis is one as each sample is analyzed independently. The log2 ratio for every sample and every marker is considered to be the “response” variable.
The model is that the observations segment into k groups, and that each group, within each dimension, has a different mean with noise. The k - 1 cut-points that optimally split the data in a maximum likelihood sense are found by minimizing the sum over all dimensions of sums of squared deviations of the group means from the observations.
The positions of these cut points are found in advance for every possible total number of segments, from two to the maximum number of segments allowed. You may set the maximum number of segments (per sub-region) kmax as one of the optional segmenting parameters (25.3.3).
To determine if we can use a given number of segments k, a permutation testing based comparison is done between each pair of adjacent segments. The permutation testing procedure is described in the next section. If the segments are significantly different for every pair of adjacent segments, k segments are used. Otherwise, this same determination is tried for k - 1 segments, and so on. This process is started at k = kmax and continued through k = 1.
It may turn out that this process continues through to k = 1. This means it is not possible to divide the observations into more than one segment nor to get any cut-points (over this sub-region).
25.10.5 Permutation Testing for Verifying Copy Number Segments
An optimal k-way split is one that defines k segments such that the sum of squared deviations from the mean of each segment is minimized. In other words, there is no better placement of cutpoints that would result in a smaller sum of squared deviations of the data from the respective mean of the segment in which it falls. The challenge then, given this optimal segmenting is determining whether the resulting segments represent significant differences between pairwise adjacent segments.
Consider two adjacent segments. The left segment has a mean, and the right segment has a mean. The cutpoint between them minimizes the sum of squared deviations of the data from the two respective means. If we were to just perform a T-test comparing the adjacent segments, we would need to adjust the resulting p-value some kind of multiple testing correction for all of the possible cutpoints that were searched through to find the best one. A Bonferroni adjustment would be too conservative due to correlation structure in the data. Instead we turn to permutation testing.
The key to performing permutation testing correctly is to randomize and perform the same procedure as was used to find the original statistic. This ends up being fairly straightforward.
Given Maxp is the Max pairwise permuted p-value parameter, for each adjacent pair of segments we perform the following procedure:
- Calculate the original sum of squared deviations from the means of the two adjacent segments.
- Set count = 1.
- Do the following 10∕MaxP - 1 times:
- Randomly shuffle the marker label columns (i.e. multivariate columns are kept together).
- Find the optimal two-way split that minimizes the sum of squared deviations from the means within the random data.
- If the randomly computed sum of squared deviations is less than or equal to the original sum of squared deviations, set count = count + 1
- The pairwise p = count∕(10∕Maxp)
If the permuted pairwise p is less than or equal to Maxp for all adjacent pairwise segments, all of the segments are statistically significant. Otherwise, we repeat the procedure with the optimal k - 1 way segmenting, until either we find all pairwise segments are significant, or all the data in a region represents a single segment (or possibly part of a larger segment if the moving window procedure is used).