" 'Where is the missing heritability?' is a question asked frequently in genetic research. The difficulty seems to come down to the common disease/common variant hypothesis not holding up." » Read more
Updated: March 8, 2010
Level: Intermediate
Modules: HelixTree, CNAM, Regression
This tutorial covers a comprehensive set of CNV analysis workflows in SVS 7 focusing on methods for processing raw intensity data, performing quality assurance, identifying regions of copy number variation (CNV), visualizing copy number data, and performing association analysis on a variety of copy number covariates.
Golden Helix offers free webinar-based training and a host of analytic services covering every aspect of genetic association studies including study design, quality assurance, and data analysis.
Outlined below is a procedure for discretizing the logR copy number segment covariates as two state (0,1) or three state (-1,0,1) covariates based on defined thresholds. Discretizing the covariates has the following benefits:
To complete this step you first need to download the following scripts:
Create Spreadsheet for Segmentation.py
Discretize CN Segment Covariates.py
Note: The Application Data folder is a hidden folder on Windows operating systems and its location varies between XP and Vista. The easiest way to locate this directory on your computer is to click on the AppData shortcut in your C:\Program Files\Golden Helix SVS\ directory or wherever you installed Golden Helix SVS.
If saved to the proper folder, these scripts should show up in the Scripts menu from any spreadsheet.
In addition to detecting copy number segments, CNAM optimal segmenting can be used to define thresholds between the three copy number states: copy number loss, copy number neutral and copy number gain. First, create a histogram of segment means for an understanding of what the segmenting algorithm is doing.
A histogram is displayed with an apparent normal distribution. With copy number data you expect to see at least three "hills" or distributions of logRs, representing copy number loss (left hill), copy number neutral (middle hill), and copy number gain (right hill). This becomes visible when you change the Bin Size on the histogram. To do this:
The histogram should now match that in Figure 13 with the three hills.
IMPORTANT: If you observe a uni-modal distribution (only one hill) regardless of Bin Count and/or a large number of outliers in your data, discretizing is not recommended. In this case using the PCA corrected segments as covariates for association testing is recommended.
The objective is to now determine the boundaries between the three hills. This is difficult to do, especially visually, as the three distributions overlap. Rather than approximate these boundaries, CNAM optimal segmenting offers a more exact approach as it minimizes the sum of squared errors between each of the distributions.
You will be prompted to select a column number to index. You want to index the Segment Means column or column #4.
This will create a new spreadsheet in the Project Navigator called Index Segment Mean – Mapped Sheet 1, with 1 row and 2,437 columns. The smallest value should be at the beginning of the spreadsheet and the largest value at the end of the spreadsheet.
Two new spreadsheets and a run log appear. The Segment List spreadsheet is all you need for this step.
Notice in the Segment List spreadsheet (Figure 15) that CNAM found three segments as indicated by the number of rows in the spreadsheet.
The End Index column denotes the boundaries between the three distributions. For the threshold values you need to locate their respective columns in the Index Segment Mean - Mapped Sheet 1 spreadsheet.
This takes you to column 407, which has a value of -0.113927878439426 denoting the boundary between copy number loss and copy number neutral.
The value at this column is 0.0986386612057686 denoting the boundary between copy number neutral and copy number gain.
NOTE: CNAM bases its calculations on the mean of each distribution and therefore its results can be affected by large outliers. This may result in discrepancies with visual inspection. What are considered outliers and how outliers are handled differs from study-to-study, array-to-array, and researcher-to-researcher. Even though visually there appear to be negative outliers in the left-most distribution, for the purpose of this tutorial they will not be treated as such and the calculated thresholds will be used for discretizing.
Now that appropriate thresholds have been determined, you can use these to discretize the copy number segment covariates (continuous logR values) as two-state (0,1) or three-state (-1,0,1) covariates. This tutorial will cover three-state discretization, though the same process can be applied to two-state discretization.
You will be prompted to choose a Three State Model or one of two Two State Models.
This will create a new spreadsheet, Three State Covariates, with -1s for potential losses, 0s for potential neutrals, and 1s for potential gains (Figure 16).
You will need to reapply the marker map.
A final spreadsheet, Three State Covariates - Mapped Sheet 1, is created, which you can use to perform association tests on discretized copy number segments.
© 2010 Golden Helix, Inc. All Rights Reserved