‹‹ Back to SVS Home

Using CNAM Optimal Segmenting

10.3 Using CNAM Optimal Segmenting

CNAM Optimal Segmenting represents the second step for performing copy number analysis after importing log2 ratio data into a project and applying an appropriate genetic marker map. See Genetic Marker Maps Overview for more information. CNAM optimal segmenting uses both the genetic marker map information and the log2 ratios in the spreadsheet to discover regions of markers in which the log2 ratios vary significantly from segment to segment. While the genome has numerous regions of copy number variation, these regions are approximated by the segments found with the CNAM optimal segmenting algorithm. These segments will, with high probability, be where there are regions of copy number loss, neutral or gain in the data.

Upon segmenting, at least two new spreadsheets are created in the current SVS project: the segment means spreadsheet and the covariates spreadsheet. The segment means spreadsheet lists every region computed, its beginning and ending marker, and the segment mean log2 ratio value for every sample within that region. A covariate segment is created for all start and end positions for all samples. Each sample will have exactly the same number of covariates. The value of a sample’s covariates is determined by the segment mean for the segment that the covariate start and end positions are contained in. The covariates spreadsheet can be output in one of two formats, either a column is created for every active marker in the spreadsheet that was segmented, or a column is created for the first marker in every covariate segment. Optionally, a Wiggle file may also be generated which contains the locations of these regions.

Options and other fields within the CNAM Optimal Segmenting tool are described below (see Figure 79).


[Picture]

Figure 79: CNAM Optimal Segmenting Window

Log2 Ratio Spreadsheet

In order to use CNAM optimal segmenting on a spreadsheet, a spreadsheet must contain log2 ratios and have a genetic marker map applied. From this spreadsheet, select Analysis > CNAM Optimal Segmenting.

Selecting Chromosomes

For large datasets, it is better to only segment a chromosome at a time or a few chromosomes at a time. As CNAM optimal segmenting does not segment across chromosomal boundaries results will not change by subdividing the segmenting by chromosome.

To select a chromosome or a few chromosomes, use the Select > Activate by Chromosomes option then select the chromosomes you wish to segment. It is not necessary to create a subset spreadsheet, as the segmenting algorithm will only run on active numeric columns.

Segmenting Options

Variations of the CNAM optimal segmenting algorithms for obtaining the regions of CNV are documented in the Formulas and Theories chapter, see CNAM Optimal Segmentation Algorithm. Certain parameters for this algorithm may be changed within these segmenting options.

Algorithm

CNAM offers two types of segmenting methods, univariate and multivariate. These methods are based on the same algorithm, but use different criteria for determining cut-points denoting CNV boundaries.

The multivariate method segments all samples simultaneously, finding general CNV regions which may be similar across all samples. This method is preferable for finding very small CNV regions. For a given sample, the covariate is the mean of the log2 ratios within each segment for that sample. These covariates can then be used for association analysis. This model makes the tenuous assumption that for a given disease, the beginning and end of a CNV region will be similar for subsets of the cases. That is, if the regions are conserved for enough cases it is expected there is sufficient power to find a statistical association. If this assumption holds true, very small CNV regions can be found because the signal will be assessed over multiple samples.

In reality there may not always be consistent CNV regions across multiple samples. The univariate method segments each sample separately, finding the cut-points of each segment for each sample individually and a spreadsheet is created showing all unique cut-points found among all samples. The univariate method discovers the optimal segments for each sample and outputs the mean, for every sample, of every unique segment found across all samples. This output can be displayed in one of two formats ready for subsequent association analysis or for plotting results. The output spreadsheets are discussed in Outputs from CNAM Optimal Segmenting.

Speed Optimization

If “None” is selected, the entire chromosome is segmented at once. If “Moving Window” is selected, the segmenting is performed using a moving window of markers. This latter approach can dramatically reduce the run time of the algorithm. This is because the run time of a given window is proportional to the square of the window size multiplied by the number of segments per window. A chromosome with 50,000 points will take 50,0002 times the maximum number of segments per window parameter operations to find the optimal cut-point(s).

Univariate outlier removal

The univariate outlier removal option helps to address the influence of large negative or large positive values on determining segment boundaries. It works by excluding found cut-points that bracket single marker segments before running permutation tests to determine the strength of the segment. This option is only valid when the minimum number of markers per segment is set to 1. If outliers are not removed and the minimum number of markers per segment is set to a number greater than one, a single marker outlier could force adjacent markers to create a segment that is driven only by the single outlier. This would inflate the number of segments that had the minimum number of markers allowed, and incorrectly specify boundaries if the number of markers in the region was actually less than the the minimum number of markers allowed in a segment. If the minimum number of markers in a segment was set to one with the univariate outlier removal box not checked then single marker segments would be found, but they would not be deemed significant with permutation testing. As a result, the algorithm looks for fewer segments at the expense of the larger, real segments. See CNAM Optimal Segmentation Algorithm for more details on this option.

Moving window size (#markers)

The number of consecutive markers analyzed in the moving window. This option is only available if “Moving Window” is selected.

Using a moving window is highly recommended because it minimizes the multiple testing problems associated with segmenting a whole chromosome at a time. As the number of markers varies for each chromosome, without specifying the size of the window to use for permutation testing, each chromosome will be treated differently making the determination for multiple testing correction more difficult and less transparent.

Experience has shown 5,000 to be a conservative default, giving good results with acceptable computation times. Note, however, there is a somewhat higher risk of false discoveries using a moving window approach as there is the potential for anomalies due to looking at a window of data instead of all of it. Permutation testing does minimize this, however.

Max segments per window

Set this to be greater than or equal to the number of CNV regions expected in a given window. The number of significant segments will be constrained to be no more than this number in a given window.

Our own empirical evidence has shown that 20 is a reasonable setting for a window size of 5,000, given marker densities of 2,000-5,000kb such as found with Affymetrix and Illumina platforms. If you make a larger window size, you should increase this value accordingly. If the Run Log (see Run Log) frequently shows the number of segments found equaling this parameter, then we recommend increasing this value.

Min #markers per segment

This constrains the algorithm to only find CNV regions with this minimum number of markers in each segment.

This parameter allows you to prevent finding CNV regions based on short spans of noise. In general the permutation testing should prevent small spurious segments from showing up, but a good default for this parameter is 1 marker with univariate outlier removal on for univariate analysis. For multivariate analysis, a minimum number of 1 marker is still a good default. It is important to take into account any outliers in the log2 ratios for a sample. Outliers can still drive the segmentation results even after permutation testing, although their effect is minimized, to remove their effect use the univariate outlier removal option.

Max pairwise segment p-value

The “Max segments per window” parameter sets an upper bound on the number of segments found in a window. However, the problem remains to determine the actual number of valid CNV regions in the data. The process used is, once a set of k segments is found, each pairwise set of segments is compared through a permutation testing procedure. If every pair is statistically significant according to the “Max pairwise segment p-value”, then the k-way split is retained. Otherwise, the algorithm continually decreases k by one until every adjacent segment is significantly different from its neighbor or no segments are found, whichever comes first.

Smaller p-values decrease the false-discovery rate but also decrease sensitivity. Conversely, larger p-values increase sensitivity, but increase the false-discovery rate.

In the example above, the number of permutations performed is 10 divided by this parameter. So 1,000 permutations will be done for p = 0.01, and 200 for p = 0.05. The default of 0.01 is a good default starting point, corresponding with a 1 percent chance of a false positive. Testing on simulated data with known answers has shown 0.01 generally has the best combination of sensitivity and false-discovery rate.

Number of Threads

Both the Univariate and Multivariate algorithms can take advantage of multi-processor or multi-core machines by performing some of their work in parallel threads. It is usually a good idea to match this number to the number of computational cores you have available on your system.

Optional Output Files

On the Optional Output tab, checking the Optional Bookmark File Output box exports the segment means to a UCSC Wiggle Track (WIG) file format file for Genome Browser import. Use the Browse button for file name selection.

If the WIG files are output while using the Univariate segmenting algorithm, the browse button will have you select a directory location as a WIG file will be generated for each sample. These files will be named using the sample name from the file.

Excluding Markers

If desired, markers can be excluded from the segmenting algorithm and its results by inactivating the columns corresponding to those markers.

Run Log

A log is shown in this window informing you of the progress in segmenting sub-regions of the total region of markers being analyzed. If the number of segments found in a given window is equal to the maximum number of segments per window, a warning message will be printed in red, suggesting the user consider increasing that parameter.

NOTE:

  • During processing, a normal progress bar is also shown in a separate window.