Step 2. Identify Markers to Exclude
It is sometimes desirable to filter out problematic markers before running segmentation, such as those with low call rates or gender-associated markers caused by effects of poorly randomized experiments. This tutorial will focus particularly on identifying gender-associated markers and creating a list of these markers to be excluded when performing LogR association tests and copy number segmenting (Step 4 and Step 5). Other measures of quality control can be applied by following a similar workflow.
To find gender-associated markers you need to perform an association test with gender as the dependent variable and LogRs as the independent variables. This can be done rather easily using the LogR Association Tests and PCA window in CNAM. From the Project Navigator window select >CNAM >LogR Association Tests and PCA.
PCA window.
From this window, Browse to the LogR DSF File created in Step 1. You can then choose which Chromosomes you want to perform association tests on. In this case, select all chromosomes – this will be the default if your LogR DSF contains information for all chromosomes. Next, Check the association tests you would like to perform.
Next, you need to select a spreadsheet within HelixTree containing your phenotype information, gender in this case. If you have not already imported your phenotype spreadsheet, from the Project Navigator window go to >File >Import Data and choose either >Import Wizard or >Import ASCII File. Make sure to indicate the column with your sample names as the row label column. This will ensure each sample’s gender status is appropriately matched to its LogRs when performing the association test.
Once your phenotype spreadsheet is imported into HelixTree, from the LogR Association Tests and PCA window click Choose Spreadsheet, select the appropriate phenotype spreadsheet and click OK.
Having selected a phenotype spreadsheet, the Quantitative or Binary Trait box will be activated. Scroll down and select the gender column to make it the dependent variable.
Now you can perform association test(s) on gender. You may also simultaneously correct for batch effects or stratification using principle component analysis. This will be covered in Step 3. For now Uncheck the Correct for Batch Effects/ Stratification with PCA box.
Click Run.
The result is a spreadsheet with markers as rows and p-values for each test statistic as columns (Figure 2).
NOTE: The p-values in this spreadsheet are not corrected for multiple testing.
after performing LogR association tests.
associated markers.
You can plot each column to visually inspect the markers associated with gender by Left Clicking on the column number and selecting Plot this Column. You should see a plot similar to Figure 3 with significant peaks across the genome, signifying gender-associated markers.
Now you need to create a row subset spreadsheet of only those markers whose p-values are less than a given threshold (e.g. <0.01).
To do this, first go back to the p-value spreadsheet, Left Click on the column number again and this time select Sort Ascending. Scroll down to the first marker that exceeds your p-value threshold and Left Click that marker’s row label once.This will turn the entire row grey, indicating its inactive. To inactivate the rest of the markers above the threshold, hold down the Shift key, scroll to the bottom of the spreadsheet and Left Click the last marker’s row label. All markers greater than the p-value threshold should now be inactivated, and all markers less than the cutoff should be activated (Figure 4).
than p-value .01.
Next, from the spreadsheet menu, select >Edit >Row >Subset Spreadsheet. This will create a new spreadsheet with only the markers less than your p-value threshold (or those remaining activate in the original spreadsheet). Next, export this spreadsheet as a CSV file by selecting >File >Save a Comma-Delimited Text File. Browse to a path where you want the CSV file saved and click Save.
Finally, you need to delete all the columns in the CSV file except for the first, containing the marker names. This can be done easily within Excel or another spreadsheet editing program. Once you have a CSV file with a single column of marker names, it can be used in Step 4 and Step 5 to exclude markers when performing LogR association tests and running the segmenting algorithm.
You can now proceed to Step 3: Correct for batch effects/ stratification or skip to Step 4: Perform whole genome LogR association tests or Step 5: Run segmenting algorithm.