Using the Copy Number Analysis Segmentation Tool
The Copy Number Analysis Segmentation tool represents the second step for doing copy number analysis within the CNAM, once you have a “.dsf” file of log2 ratio data. It scans this file and uses a segmenting algorithm (25.10) to discover regions of markers in which your log2 ratios vary significantly from region to region. These regions will in general be where there is copy number variation in your data.
This tool then creates a dataset (spreadsheet) in your current HelixTree project that lists every region computed, its beginning and ending marker, and the sample mean for every sample within that region. Optionally, a Wiggle file may also be generated which contains the locations of these regions.
A description of the options and other fields within the Copy Number Analysis Segmentation tool follows.
|
25.3.1 The DSF File
Select the “.dsf” file of log2 ratio data. Use the browser button for easier file selection.
25.3.2 Chromosome Selector
When you select the “.dsf” file of log2 ratio data, a list of chromosomes within that file will appear.
If you wish to deselect some of these, press the “Chromosomes” button to get a chromosome selector. This will have a check box for each chromosome, so that you can deselect any chromosomes that you wish.
25.3.3 Segmenting Options
The segmenting algorithms for obtaining the regions of markers where there are probable copy number variations is documented below (25.10). Certain parameters for this algorithms may be changed within these segmenting options.
25.3.3.1 Algorithm
CNAM offers two types of segmenting methods, univariate and multivariate. These methods are based on the same algorithm, but use different criteria for determining cut-points.
The multivariate method segments all samples simultaneously, finding general copy number regions that may be similar across all samples. This method is preferable for finding very small copy number regions. For a given sample, the covariate is the mean of the log2 ratios within each segment for that sample. These covariates can then be used for association analysis. This model makes the tenuous assumption that for a given disease, the beginnings and ends of the copy number variation will be similar for subsets of the cases. That is, we expect the regions to be conserved for enough cases that we would have power to find a statistical association. If this assumption holds true, we can potentially find very small regions of variation because the signal will be assessed over multiple samples.
In reality there may not always be consistent copy number segments across multiple samples. The univariate method segments each sample separately, finding the cut-points of each segment for each sample and outputting a spreadsheet showing all unique cut-points found among all samples. The univariate algorithm, discovers the optimal segments for each sample and outputs the mean, for every sample, of every unique segment found across all samples. This output is in a format optimal for further association analysis.
25.3.3.2 Speed optimization
If None is selected, the entire chromosome is segmented at once. If Moving Window is selected, the segmenting is performed using a moving window. This latter approach can dramatically reduce the run time of the algorithm. This is because the run time of a given window is proportional to the square of the window size multiplied by the number of segments per window. A chromosome with 50000 points will take 500002 operations times the Max segments per window parameter.
25.3.3.3 Moving window size (#Markers)
The number of consecutive markers that will be analyzed in the moving window. This option is only available if Speed optimization Moving Window is selected.
Experience has shown 10000 to be a conservative default, giving good results with acceptable computation times. Note, however, that there is a somewhat higher risk of false discoveries using a moving window approach as there is the potential for anomalies due to looking at a window of the data instead of all of it. The permutation testing approach does minimize this, however.
25.3.3.4 Max segments per window
Set this to be greater than or equal to the number of copy number variations expected in a given window. The number of significant segments will be constrained to be no more than this number in a given window.
Experience has shown that 40 is a reasonable setting for a window size of 10000, given marker densities of 2000-5000kb such as found with Affymetrix and Illumina platforms. If you make a larger window size, you should increase this value accordingly. If the run log (see 25.3.6) shows the number of segments found equaling this parameter, then we would recommend increasing it.
25.3.3.5 Min #Markers per segment
This constrains the algorithm to only find copy number segments with this minimum number of markers in each segment.
This parameter allows you to prevent finding segments based on short spans of noise. In general the permutation testing should prevent small spurious segments from showing up, so this parameter is set to a good default of 1.
25.3.3.6 Max pairwise segment p-value
The Max segments per window parameter sets an upper bound on the number of segments found in a window. However, the problem remains to determine the actual number of valid copy number segments in the data. The process that is used is once a set of k segments is found, each pairwise set of segments is compared through a permutation testing procedure. If every pair is statistically significant according to the Max pairwise segment p-value, then the k-way split is retained. Otherwise, we continually decrease k by one until every adjacent segment is significantly different from its neighbor or no segments are found, whatever comes first.
Smaller p-values decrease the false-discovery rate but also decrease sensitivity. Conversely, larger p-values increase sensitivity but increase the false-discovery rate.
The number of permutations performed is 10 divided by this parameter. So 1000 permutations will be done for p = 0.01, 200 for p = 0.05. The default of 0.01 is a good default starting point, corresponding with a 1 percent chance of a false positive. Testing on simulated data with known answers has shown 0.01 generally has the best combination of sensitivity and false discovery rate.
25.3.3.7 Number of Threads
Both the Univariate and Multivariate algorithm can take advantage of multi-processor or multi-core machines by performing some of their work in parallel threads. It is usually a good idea to match this number to the number of computational cores you have.
25.3.4 Optional File Output
|
Check this box to export the segment means to a UCSC Wiggle Track (WIG) format file for Genome Browser import. Use the Browse button for file name selection.
If outputting WIG files while using the Univariate segmenting algorithm, the browse button will have you select a directory location as a WIG file will be generated for each sample. These files will be named using the sample name from the file.
25.3.5 Exclude Markers
|
If desired, use the Browse button to open a CSV or text file containing markers to exclude from the segmenting algorithm and its results. Use the viewer to preview the markers that will be excluded. To remove the list of markers, use the Clear List button.
25.3.6 Run Log
A log is shown in this window informing you of the tool’s progress in segmenting sub-regions of the region of markers being analyzed. It will show how many segments have been found in each sub-region which has been analyzed, as soon as the tool has finished with that region.
If the number of segments found in a given window is equal to Max segments per window, a warning message will be printed in red, suggesting that the user consider increasing that parameter.
NOTE: During processing, a normal progress bar is also shown in a separate window.