Created:
March 27, 2008

Updated:
July 16, 2008

User Level:

Intermediate

Products:
HelixTree, CNAM, WGA Module


Step 4. Perform Whole Genome Log Ratio Association Tests

It is advantageous to perform association directly on LogRs prior to running copy number segmentation (Step 5) because the computation time is much faster and it provides a good first look at the data. Please note however, the number of LogRs will be much greater than the number of copy number segment covariates resulting in a greater multiple testing penalty.

LogR Association & PCA Window
Figure 1. LogR Association Tests and
PCA Window.

You can perform LogR association tests on any LogR DSF file. We recommend first correcting for batch effects and stratification and filtering out problematic markers as covered in Steps 2 and 3.

NOTE: LogR association tests can be performed simultaneously while correcting for batch effects and stratification and excluding problematic markers, all from the same window. This tutorial highlights each process separately.

To get started, from the Project Navigator window select >CNAM >LogR Association Tests and PCA. In this window (Figure 1), Browse to the PCA corrected LogR DSF file created in Step 3 and select the Chromosomes you want to test. Check the box next to the association test(s) you want to run.

Next, you need to select a spreadsheet within HelixTree containing your phenotype information, in this case case-control status or a quantitative trait. Your phenotype spreadsheet should have already been imported during Step 2. If you have not already imported your phenotype spreadsheet, from the Project Navigator window go to >File >Import Data and choose either >Import Wizard or >Import ASCII File. Make sure to indicate the column with your sample names as the row label column. This will ensure each sample’s phenotype status is appropriately matched to its LogRs when performing the association test(s).

Once your phenotype spreadsheet is imported into HelixTree from the LogR Association Tests and PCA window, click Choose Spreadsheet, select the appropriate phenotype spreadsheet and click OK. A list of all quantitative and binary variables in the phenotype spreadsheet will now appear in the next box. Select the variable you want as your dependent.

Because the DSF file selected above already contains PCA corrected LogRs, uncheck the Correct Batch Effects/ Stratification with PCA box. This will inactivate all PCA related options, including the parameters on the next tab.

If you wish to filter other markers than those already excluded in Step 3, go to the Exclude Markers tab, Browse to the CSV file with the additional markers and click Open.

You are now ready to perform association. Click Run.

The result will be a p-value spreadsheet (not shown) with all the marker names in the first column and their corresponding p-values in each additional column.

You can plot each column by Left Clicking on the column number and selecting Plot this Column.

PCA Corrected LogRs
Figure 2. P-value plot from t-tests on
PCA corrected LogRs.

In some cases, there may still be spurious associations across the genome (as seen in Figure 2). To reduce some of the noise you can apply a Median Smooth script to the p-values of each column. The Median Smooth script calculates a moving median centered about a given observation plus/minus a user-defined window size. This script is not provided with the software but can be downloaded from our Add-on Script Repository.

To use the Median Smooth script, Download the mediansmooth.py file from the webpage above and save it in your ../ HelixTree/scriptsht/spreadsheet/edit/ directory. Next, from the p-value spreadsheet select >Edit >Median Smooth. A dialogue window will appear asking for a moving window size (Figure 3). Enter an appropriate window size (i.e. 3) and click OK.

Figure 3. Median Smooth script with a 3 marker
window size.

NOTE: A window size of 3 would include 7 observations The result is a new p-value spreadsheet with the original p-value columns and new median smoothed p-valued columns. You can again plot each column by Left Clicking on the column header and selecting Plot this Column.

You should see far fewer spurious associations in a smoothed plot (Figure 4) versus an unsmoothed plot (Figure 2). By Left Clicking and Dragging on the plot you can zoom into a region. Zooming in on a significant spike (Figure 5) should reveal a region spanning several markers with multiple significant p-values.

Median Smoothed Plot
Figure 4. Plot of median smoothed
p-values.
Zoomed P-value Plot
Figure 5. Zoomed p-value plot of
~10 marker significant peak.













A quick way to determine where significant markers reside on the genome and what genes they are associated with is to join the p-value spreadsheet with a marker map containing these markers. Such a marker map was imported in Step 1 and should already be in the project.

To do this, go back to the Median Smoothed p-value spreadsheet and select >File >Join Spreadsheets on Row Labels. Select the appropriate marker map spreadsheet and click OK. The result is a spreadsheet with marker names, p-values and marker map information.

From here you can go to Step 5 to run the Golden Helix segmenting algorithm on only those chromosomes revealing significant LogR association results or the entire PCA corrected LogR DSF.