Step 6. Discretize Copy Number Segment Covariates
As an additional step, you may want to discretize your copy number segment covariates identified in Step 5 as two state (0,1) or three state (-1,0,1) covariates. Discretizing the covariates has the following benefits:
- Approximate copy number calls (potential: deletion, neutral, duplication, not a deletion, or not a duplication) can be made based on thresholds that signify a transition between copy number states.
- Interactive tree analysis can be performed (Step 7) with up to three way splits to find the most significant segments based on case/control status and a fixed threshold.
- P-value plots based on logistic regression (LogR covariates as independent, case/control status as dependent) can magnify small statistically significant differences between cases and controls.
- Using discretized values does not let outliers (extremely small or large LogR values) have any more influence on the p-value than those close to the threshold.
- Sometimes it is not possible to join the phenotype spreadsheet with the Copy Number Segment Covariates spreadsheet (Step 5) because the spreadsheet is too large. Discretizing will allow association tests to be performed.
This step leads you through discretizing the segment mean covariates as both two and three state covariates.
To complete this step you first need to download the Discretize CN Segment Covariates script from our Add-on Scripts Repository.
Save this script in the following directory:
*../HelixTree/scriptsHT/user/Spreadsheet/Scripts/
the histogram at this view is hard to see because of a few
outliers with large negative LogR values.
Next, you need to determine the appropriate threshold tolerance levels for copy number states by viewing a histogram of the segment means.
To do this, open the Segment Means spreadsheet created in Step 5. Left click on the Segment Mean column header to turn the column magenta and then select >Analysis >Interactive Tree Analysis. A window with a single box (referred to as the root node) will appear. Click on this box and choose >Visualize Split Data >Histogram. A histogram similar to Figure 1a will be displayed with three "hills".
plot.
To determine the appropriate tolerance threshold(s) you need to identify the lowest points of the "valley(s)", at least approximately. The histogram in Figure 1a and its respective hills and valleys are barely distinguishable due to a few outliers with large negative numbers skewing the plot to the left. To get a better view you can zoom in by Right clicking and dragging under the x-axis and around the center of the histogram (Figure 1b).
Zoom in until your plot looks like Figure 1c. Notice that the Number of Bins (bottom right of the plot) is set to 1024.
From this view you can better distinguish the hills and valleys. Appropriate tolerance level thresholds in this instance would be around -0.07 for deletions and +0.07 for duplications. Anything in between would be copy number neutral.
Now that you have determined the appropriate thresholds you can use these to discretize the copy number segment covariates (continuous LogR values) as two state (0,1) or three state (-1,0,1) covariates.
Three-State Discretization
Open the Copy Number Segment Covariates spreadsheet created in Step 5. From this spreadsheet, select >Scripts >Discretize CN Segment Covariates. You will be prompted to select the copy number model. First select Three-State Model and click OK. On the next window, for the Lower Tolerance Level enter the smaller of the two threshold values determined above and for the Upper Tolerance Level enter the larger of the two threshold values determined above. In this example, you would enter -0.07 and 0.07 respectively. Click OK.
The result will be a new spreadsheet with the same copy number segments, but instead of containing continuous LogR values, it will contain -1s for potential deletions, 0s for potential neutrals, or 1s for potential duplications.
Two-State Discretization
You can create two-state discretized covariates from either the original Copy Number Segment Covariates spreadsheet or from the Three State Covariates spreadsheet.
From the original Copy Number Segment Covariates spreadsheet select >Edit >Discretize CN Segment Covariates. First choose either of the Two State Models and click OK. This time you only need to enter one tolerance level. For deletions, enter the smaller of the two thresholds determined above; for duplications enter the larger of the two. Click OK.
The result will be a new spreadsheet containing all 0s and 1s. For the Two State -- Deletions model a 1 represents a deletion and a 0 represents not a deletion. For the Two State -- Duplications model a 1 represents a duplication and a 0 not a duplication.
Alternatively, you can run the Discretize CN Segment Covariates script from the Three State Covariates spreadsheet. Instead of entering the threshold values determined above, you would enter 0 for both the Two State -- Deletions and Two State -- Duplication models.