1. Home
  2. SVS
  3. SVS Tutorials
  4. Advanced Tutorials
  5. SVS Rare Variant Analysis Tutorial

SVS Rare Variant Analysis Tutorial

Welcome to the Rare Variant Analysis of Complex Diseases Tutorial

_images/lower-risk-kbac-cmc-comparison-plot.png

Updated: January 28, 2021

Level: Advanced

Packages: DNA-Seq, Power Seat

This tutorial covers a complex case/control variant analysis workflow. The steps include variant collapsing and association testing using sequence data and a simulated phenotype.

Requirements

To complete this tutorial you will need to download and unzip the following file, which includes a starter project.

SVS_Rare_Variant_Tutorial.zip

Files included in the above ZIP file:

  • Variant Analysis Tutorial – Starter project containing variant data.

1. Overview

The technology for high-throughput production of gene and protein sequence data is rapidly improving, and the applications of sequence technology are also developing rapidly. In general, DNA sequence analysis is tasked with studying the effects of rare and low-frequency variants, as well as potentially common variants on the phenotype of interest. The bioinformatics of sequence analysis ranges from instrument-specific processing of raw data to the final aggregation of multiple samples into data mining and analysis tools. The software of sequence analysis can be categorized into the three stages of the data’s lifecycle: primary, secondary, and tertiary analysis.

Primary analysis can be defined as the machine-specific steps needed to call individual bases and compute quality scores for those calls. Because current sequencing technologies are generally based on the “shotgun” approach of chopping all the DNA up into smaller molecules and then generating what are referred to as “reads” of these small nucleotide sequences, it’s left up to secondary analysis to reassemble these reads to get a representation of the underlying long-range sequence. Secondary analysis is the process of assembling the individual reads or aligning them to a reference genome and detecting variants. While more customizable, and sometimes considered part of tertiary analysis, variant calling lends itself to being pipelined in the same manner as alignment. Out of the secondary analysis step of variant calling, you now have a more manageable set of differences between the sequenced samples and the reference, but there is still an enormous amount of data to make sense of. This is the realm of tertiary analysis.

Association techniques used in GWAS studies do not have the power to detect the significance of rare variants individually or provide tools for measuring their compound effect, referred to as rare variant burden. To do this, it is necessary to collapse several variants into a single covariate based on regions such as genes. This tutorial will cover methods based on detecting the presence of or counting the variants in a gene, the Combined Multivariate and Collapsing (CMC) method, the Kernel-Based Adaptive Collapsing (KBAC) method, and the Optimized Sequence Kernel Association Test (SKAT-O).

Exactly When is Rare Variant Analysis Appropriate?

The quick answer to this question is, “Rare Variant Analysis is normally appropriate when next-generation sequencing (NGS) is used, and normally not needed when microarrays are used.”

That is because NGS sequencing can produce many variant calls where the minor allele frequency (MAF), or the variant allele frequency (VAF), is only 1%, or sometimes far less than 1%, and for which only one or two or three samples may contain the variant, while microarrays normally produce data that may be properly analyzed through GWAS methods. GWAS studies are best done where:

  • The minor allele frequency (MAF) (also sometimes called variant allele frequency or VAF) is at least 1% (.01), and preferably 5% (.05), and
  • The expected number of cases with at least one minor allele present (MAF or VAF times number of cases) and the expected number of controls with at least one minor allele present (MAF or VAF times number of controls) should both be at least 5, and preferably at least 10.

Otherwise, you should use rare variant analysis techniques.

Since NGS can find common variants as well as rare variants, you may have NGS data for which you would like to use normal GWAS techniques–just make sure your data fits the specifications above. In order to do this, you may wish to filter out the rare variants from your data while leaving in the common variants.

NOTE: The fact that the data used in this tutorial must be analyzed using rare variant techniques is illustrated by the (attempted) GWAS test performed later in this tutorial for demonstration purposes. (See Section 6 of this Tutorial, Compare Results with GWAS Approach, subheading Attempt an Association Test on Individual Rare Variants.)

Download Annotation Data Sources

Later in the tutorial, an annotation track will be used to create a variant frequency bin spreadsheet. You will need to download this track before proceeding.

NOTE: If you have completed the Intro to NGS Tutorial, you should have already downloaded this particular track (1kG Phase3 – Variant Frequencies 5a with Genotype Counts, GHI). Otherwise:

  • Open the previously downloaded project.
  • From the SVS Project Navigator, choose Tools > Manage Data Sources.
  • Click Public Annotations.
  • Navigate to 1kG Phase3 – Variant Frequencies 5a with Genotype Counts, GHI and check the box to the left of this track. See Figure 1-1.
Data Source Library
Figure 1-1: Data Source Library

  • Click Download.

After the download finishes, you may close the Data Source Library and Downloads windows and proceed with the tutorial.

Note

For this tutorial, you will also need the annotation track RefSeq Genes 105 Interim v1, NCBI. However, this track is available locally by default, so you should not need to download it.

2. Collapsing Rare Variants

The Phenotypes + 1kG Exomes – Sheet 1 spreadsheet consists of exomes taken from the 1000 Genomes Phase 1 project, joined with a simulated phenotype. For expediency in this tutorial, we are using only data from Chromosomes 5 and 6.

Two approaches for collapsing the rare variants are discussed below. The first detects the presence of at least one variant in a gene or region, while the second counts the number of variants in a gene or region.

Detect the Presence of a Variant in a Gene

We will use the Count Variants per Gene tool to detect the presence or absence of variants for each subject in gene regions of the spreadsheet. The resulting spreadsheet contains binary columns (presence/absence) for each gene. A Fisher’s exact test can then be performed with the simulated phenotype.

  • From Phenotypes + 1kG Exomes – Sheet 1, choose DNA-Seq > Collapsing Methods > Count Variants per Gene.
  • The RefSeq Genes 105 Interim v1, NCBI track and the Reference field should be selected by default.
  • Choose Both Ref_Alt and Alt_Alt variants after Choose the types of variants to count:.
  • Leave Binary Presence/Absence of Variants (Per Gene).
  • The dialog should look like Figure 2-1. Click OK.
Count Variants per Gene dialog
Figure 2-1: Count Variants per Gene dialog

A new spreadsheet called Variant present in gene – Both Alt_Alt and Ref_Alt variants is created. The marker map entry for each column name contains the start and end map positions for the gene region represented by the column. The marker map also includes the gene name and other data for the gene represented by that column.

Join the phenotype information with this spreadsheet so that a Fisher’s Exact Test can be performed.

NOTE: This test is performed in [Cohen2004].

  • Open the Phenotypes + 1kG Exomes – Sheet 1 spreadsheet and choose Select > Column > Inactivate All Columns. This will create spreadsheet Phenotypes + 1kG Exomes – Sheet 2.
  • Reactivate columns 1 and 2 (Case? and Population) by left-clicking once on each column’s header.
  • Choose File > Join or Merge Spreadsheets.
  • Choose Variant present in gene – Both Alt_Alt and Ref_Alt variants from the list and click OK.
  • Change New dataset name: to C/C + Variant present in gene – Both Alt_Alt and Ref_Alt variants.
  • The dialog should look like Figure 2-2. Click OK.
Join or Merge Spreadsheets dialog
Figure 2-2: Join or Merge Spreadsheets dialog

Next, run Fisher’s Exact Test from the merged spreadsheet.

  • Click once on the Case? column header to change the color of the column to magenta to signify its dependent status.
  • Choose Numeric > Fisher’s Exact Test for Binary Predictors.

The resulting spreadsheet contains the statistical test results.

  • From Fisher’s Exact Test Results, right-click the -log10 P-value column header and choose Plot Variable in GenomeBrowse.
  • In the chromosome chooser dropdown near the upper-left corner of the plot window, enter 5 – 6. Then click the right-arrow icon  just to the right of the genomic coordinate search bar near the top of the plot window.
  • In the Plot Tree, right-click on the second -log10 P-Value graph item and select Edit Title…. Enter in the new name: Fisher’s Exact -log10 P on Presence of Var.

The plot should look like Figure 2-3.

_images/fishers-p-value-plot.png
Figure 2-3: P-value plot from Fisher’s Exact Test results

Count Number of Variants (Per Gene)

Repeat this analytic method, instead choosing Count of Number of Variants (Per Gene) in the Count Variants per Gene dialog. The resulting spreadsheet will contain integer columns (rather than binary) with the count of variants found in each gene. Thus a Numeric Association test can be performed (rather than a Fisher’s Exact Test). This test will assess the burden of multiple damaging variants in each individual, rather than the simple presence of one or more damaging variants. This test is similar to the test (Cohort Allelic Sums Test or CAST) discussed in [Morgenthaler2007].

  • From Phenotypes + 1kG Exomes – Sheet 1, choose DNA-Seq > Collapsing Methods > Count Variants per Gene.
  • After Choose the types of variants to count: select Both Ref_Alt and Alt_Alt variants.
  • After Select the type of output to generate: choose Count of Number of Variants (Per Gene).
  • The dialog should look like Figure 2-4. Click OK.
Count Variants per Gene dialog
Figure 2-4: Count Variants per Gene dialog

Join the phenotype information with this spreadsheet.

  • Open Phenotypes + 1kG Exomes – Sheet 2 and choose File > Join or Merge Spreadsheets.
  • Choose the Variant counts per gene – Both Alt_Alt and Ref_Alt variants from the list and click OK.
  • Change New dataset name: to Phenotype + Variant counts per gene – Both Alt_Alt and Ref_Alt variants.
  • Click OK.

Run a numeric association test from the merged spreadsheet.

  • Left-click once on the Case? column header to change the color of the column to magenta to signify its dependent status.
  • Choose Numeric > Numeric Association Tests.
  • Check Correlation/Trend test, leave the other options at their default values, and click Run.

Add the results to the previous plot.

  • Open Plot of Column -log10 P-Value from Fisher’s Exact Test Results
  • Click on the -log10 P-value node in the Plot Tree.
  • Choose the Add tab in the Controls dialog box and select the Add Plot Item(s) button. This will open the Data Source Library.
  • Click the Project button in the upper left corner to select data from an existing spreadsheet.
  • Choose Association Tests and check Corr/Trend -log10 P from the list. See Figure 2-5.
_images/addDataSources.png
Figure 2-5: Add Data Sources Window
  • Click Plot & Close.

Next, rename the graph item.

  • Right-click on the graph item Corr/Trend -log10 P and select Edit Title….
  • Enter in the new name: Corr/Trend -log10 P on Number of Variants.
  • Then on the Style tab of the Controls dialog, click the blue rectangle to change the color to green to differentiate the value points from those of the first plot.

The plot should now contain results from two analyses and look like Figure 2-6.

_images/fishers-vs-count-plot.png
Figure 2-6: P-value comparison plot

Notice that, in general, a higher significance is obtained from testing for association on variant counts rather than from testing for association on the mere presence of variants.

3. Variant Frequency Binning and CMC Method

The CMC method [Li2008] bins variants according to parameters such as minor allele frequency, then collapses variants from each bin within defined regions such as genes. In your own study, you may want to use a different method to define your own variant bins. This tutorial demonstrates the process of creating bins based on minor allele frequency as determined by the 1000 Genomes Project Phase 1 data.

Create Frequency Bins

  • Open Phenotypes + 1kG Exomes – Sheet 1 and choose DNA-Seq > Variant Binning by Frequency Source.
  • Choose Select Track and highlight 1kG Phase3 – Variant Frequencies 5a with Genotype Counts, GHI. Click Select and Next>.
  • Leave the single threshold value of 0.01.
  • The dialog should look like Figure 3-1. Click OK.
Variant Binning by Frequency Track dialog
Figure 3-1: Variant Binning by Frequency Track dialog

The resulting spreadsheet contains at least four columns:

  • An integer Allele Frequencies Bin column indicating into which bin the marker fell.
  • The Allele Frequencies value from the selected annotation track (in this case showing us the alternate allele frequency of the SNP in the One-thousand Genomes Project (1kG)).
  • A list of alleles present.
  • The observed reference/alternate alleles from the annotation track.
  • Additionally, other fields from the track may be included. For this track, 32 other information fields for each marker are provided.

If the marker from the spreadsheet was not present in the selected probe track, the Allele Frequencies value is listed as missing and the marker is assigned to the 0 bin, as the variant must be so rare that it wasn’t even identified in the 1kG reference dataset used to build the probe track.

CMC Method

The CMC method uses the variant bins to perform a multivariate test collapsed over specified regions. For this test, those regions will be gene regions. This method uses the combined effect of multiple variants in a gene to determine the association with the phenotype. For this test, we will use the CMC Hotelling T2 algorithm.

  • In the Phenotypes + 1kG Exomes – Sheet 1 spreadsheet, left-click once on the Case? column name header to change the color of the column to magenta to signify its dependent status. Spreadsheet Phenotypes + 1kG Exomes – Sheet 3 will be created.
  • Choose DNA-Seq > Collapsing Methods > CMC with Hotelling T Squared Test.
  • The RefSeq Genes 105 Interim v1, NCBI track should be selected by default.
  • Under Variant Bins for CMC click Select Sheet.
  • Select the Variant Frequency Bins from 1kG Phase3 – Variant Frequencies 5a with Genotype Counts, GHI spreadsheet created above and click OK.
  • Click Select Column for Variant Bin column….
  • Choose the Allele Frequencies Bin column and click OK.
  • The dialog should look like Figure 3-2. Click OK to run CMC with the Hotelling algorithm.
CMC with Hotelling T Squared Test dialog
Figure 3-2: CMC with Hotelling T Squared Test dialog

The resulting spreadsheet, CMC with Hotelling T Squared Test with RefSeq Genes 105 Interim v1, NCBI, uses row numbers (which correspond to indexes into the marker map) for row labels and, for each gene region, reports the chromosome, start and end positions, gene name, transcript name(s), strand, the CMC p-value, the -log10 P-Value and various statistics and multiple-testing-corrected results.

Now, plot the -log10 p-values to investigate possible associations between rare variants and the phenotype.

  • Right click on the -log10 P-Value column header (Column 8) and choose Plot Variable in GenomeBrowse.
  • In the chromosome chooser dropdown near the upper-left corner of the plot window, enter 5 – 6. Then click the right-arrow icon  just to the right of the genomic coordinate search bar near the top of the plot window.

To prevent confusion later, rename the graph item to reflect the CMC test performed.

  • In the Plot Tree, right-click on the second -log10 P-Value graph item and select Edit Title…. Rename this graph item to CMC Hotelling T^2 -log10 P. See Figure 3-3.
CMC -log10 P-values from the Hotelling's T Squared algorithm
Figure 3-3: CMC -log10 P-values from the Hotelling’s T Squared algorithm

CMC Method with Regression

The CMC method also offers the regression algorithm. The CMC regression algorithm not only offers correction for covariates (integer, real or binary), but also allows having a quantitative dependent variable.

Next, use CMC regression simply to cross-check the results. If a gene is significant under one CMC method, one would expect it to be significant using the other.

  • From Phenotypes + 1kG Exomes – Sheet 3 choose DNA-Seq > Collapsing Methods > CMC with Regression.
  • The RefSeq Genes 105 Interim v1, NCBI track should be selected by default.
  • Under Variant Bins for CMC click Select Sheet.
  • Select the Variant Frequency Bins from 1kG Phase3 – Variant Frequencies 5a with Genotype Counts, GHI spreadsheet created earlier and click OK.
  • Click Select Column for Variant Bin column….
  • Choose the Allele Frequencies Bin column and click OK.
  • To save time, after Significance Testing, uncheck Also compute permuted p-values.
  • The dialog should look like Figure 3-4. Click OK to run CMC with regression.
CMC with Regression dialog
Figure 3-4: CMC with Regression dialog

The resulting spreadsheet, CMC Regression with RefSeq Genes 105 Interim v1, NCBI, also uses row numbers (indexes into the marker map) for row labels and also, for each gene region, reports the chromosome, start and end positions, gene name, transcript name(s), strand, CMC p-value, -log10 P-Value and various statistics and multiple-testing-corrected results. (For the test we just performed, these statistics include the beta and standard error for regressing on each bin, because we kept the default selection of Output bin betas and their standard errors in the dialog.)

Add these results to the previously created plot.

  • From Plot of Column -log10 P-Value from CMC with Hotelling T Squared Test with RefSeq Genes 105 Interim v1, NCBI, click on -log10 P-Value in the Plot Tree. Choose the Add tab and select Add Plot Item(s) to open the Add Data Sources dialog.
  • Click the Project button in the upper left corner to add data from an existing spreadsheet.
  • Choose CMC Regression with RefSeq Genes 105 Interim v1, NCBI.
  • Check -log10 P-Value from the list and click Plot & Close.
  • Rename this graph item (the second -log10 P-Value) by right-clicking on it and selecting Edit Title…. Rename it to CMC w Regression -log10 P and change the color of the data points to green under the Style tab of the Controls dialog. The plot should look like Figure 3-5.
_images/cmc-comparison-p-values.png
Figure 3-5: CMC P-values from Hotelling in blue and from regression in green

Notice that (according to this simulated data) the regression p-values (green) do not seem to be as significant as the T-test p-values (blue), even while both algorithms do agree on which genes are more significant than other genes.

CMC Method with Regression Using Transcripts

Normally, CMC (and also KBAC) will consolidate all transcripts that have the same gene name into one region, starting with the minimum start position of any transcript and ending with the maximum stop position of any transcript.

However, CMC and KBAC can also perform an individual test for every transcript. Additionally, CMC and KBAC allow you to enlarge the region on both ends by a specified number of base pairs to include further variants.

Performing an individual test for every transcript will allow you to catalog associations by transcript and may result in different levels of significance between transcripts that cover the same gene. However, this option is not normally recommended because it will result in a more severe multiple testing penalty than will consolidating transcripts into genes.

You may want to enlarge the gene or transcript region if you suspect the further variants that are included may affect gene regulation or splicing. However, enlarging the region may include variants from neighboring regions.

For comparison, run CMC using transcripts to define regions and also expand the region size. We will compare the results to those already obtained using genes as regions (without expansion).

  • From Phenotypes + 1kG Exomes – Sheet 3, choose DNA-Seq > Collapsing Methods > CMC with Regression.
  • The RefSeq Genes 105 Interim v1, NCBI track should be selected by default.
  • To the right of Perform one test per…, select transcript.
  • Check Include nearby markers (distance in bp): and leave the default of 1000.
  • Under Variant Bins for CMC click Select Sheet.
  • Select the Variant Frequency Bins from 1kG Phase3 – Variant Frequencies 5a with Genotype Counts, GHI spreadsheet created earlier and click OK.
  • Click Select Column for Variant Bin column….
  • Choose the Allele Frequencies Bin column and click OK.
  • To save time, after Significance Testing, uncheck Also compute permuted p-values.
  • The dialog should look like Figure 3-6. Click OK to run CMC with regression.
CMC with Regression on transcripts dialog
Figure 3-6: CMC with Regression on transcripts dialog

The resulting spreadsheet, also called CMC Regression with RefSeq Genes 105 Interim v1, NCBI, uses row numbers (indexes into the marker map) for row labels, and, for each transcript, reports the chromosome, the actual start and end positions used, the gene name, the transcript name, the strand, the p-value, the -log10 P-Value and the other remaining statistics for the transcript, including the beta and standard error for regressing on each bin.

Now add these results to the previously created plot.

  • In the plot that has -log10 P-values for the other two CMC tests (Plot of Column -log10 P-Value from CMC with Hotelling T Squared Test with RefSeq Genes 105 Interim v1, NCBI), click on -log10 P-Value in the Plot Tree.
  • Choose the Add tab and select Add Plot Item(s) to open the Add Data Sources dialog.
  • Click the Project button in the upper left corner to add data from an existing spreadsheet.
  • Choose the second CMC Regression with RefSeq Genes 105 Interim v1, NCBI .
  • Check -log10 P-Value from the list and click Plot & Close.
  • Rename this graph item (now the second -log10 P-Value) by right-clicking on it and selecting Edit Title…. Rename it CMC Reg on Transcripts -log10 P and change the color of the data points to orange under the Style tab of the Controls dialog.

Notice that the regression results using transcripts (orange) are very similar to the regression results using genes (green). To see this better, put the transcript results “on the bottom” in the graph.

  • Press and hold the mouse button on the CMC Reg on Transcripts -log10 P node in the Graph Control Interface, then drag the node to just under the CMC Hotelling T^2 -log10 P node and drop it, so that the CMC Reg on Transcripts -log10 P node becomes the last node under -log10 P-Value.

The plot should look like Figure 3-7.

CMC P-values: Hotelling, blue; regression, green; regression with transcripts, orange
Figure 3-7: CMC P-values: Hotelling, blue; regression, green; regression with transcripts, orange
  • To see the similarities between CMC Reg on Transcripts -log10 P and CMC w Regression -log10 P even more clearly, uncheck CMC Hotelling T^2 -log10 P and then alternately check and uncheck CMC Reg on Transcripts -log10 P.

4. KBAC Method

The KBAC method [Liu2010] collapses the variant data within a region by categorizing it into multi-marker genotypes over specified regions. No binning procedure is necessary to use the KBAC method.

The KBAC method uses the counts of these multi-marker genotypes to perform a special multivariate case/control test to determine their association with the (case/control) phenotype. This test gives multi-marker genotypes with higher sample risks higher weights so as to potentially separate causal from non-causal multi-marker genotypes.

This weighting procedure results in the KBAC method being much more suitable as a one-sided test than as a two-sided test.

KBAC with Permutation Testing

  • From Phenotypes + 1kG Exomes – Sheet 3 choose DNA-Seq > Collapsing Methods > KBAC with Permutation Testing.
  • To save time, check Adaptive permutation testing (threshold alpha): and leave the default .01 value.
  • Check Two-sided statistics under Outputs, so that One-sided statistics (recommended) and Two-sided statistics are both checked.
  • Under Regions, the RefSeq Genes 105 Interim v1, NCBI track should already be selected by default.
  • The dialog should look like Figure 4-1. Click OK to run KBAC with permutation testing.
KBAC with Permutation Testing dialog
Figure 4-1: KBAC with Permutation Testing dialog

As you may have guessed by looking at the secondary (lower) progress bar, there is a substantial amount of time saved by using the adaptive permutation procedure. With this procedure, a gene that has a more highly significant p-value is tested with the maximum number of permutations, while other genes that are in a less significant p-value range are only tested with enough permutations to give a reasonable enough estimate of the p-value to be able to infer its less-than-high significance.

Only a few of the gene regions in this example require anywhere near all of the 1000 (maximum) permutations that were requested in the dialog, and a majority of regions require fewer than 100 or even 50 permutations.

NOTE: If you have a dataset with more than 400 cases and 400 controls, then either using the Marginal binomial kernel type or the Asymptotic normal kernel type, or using the KBAC Monte-Carlo approximation as the permutation mode, can save some computation time–up to 50% when Asymptotic normal and KBAC Monte-Carlo approximation are combined. However, the data we are using only has 47 cases–not enough to warrant using these approximations.

The resulting spreadsheet, KBAC with Permutation Testing with RefSeq Genes 105 Interim v1, NCBI, uses row numbers (indexes into the marker map) for row labels, and, for each gene region, reports the chromosome, the start and end positions, the gene name, the transcript name(s), the strand, the p-value, the -log10 P-Value and various statistics and multiple-testing-corrected results for both the one-sided KBAC and two-sided KBAC tests.

Plot the -log10 p-values to investigate possible associations between rare variants and the phenotype.

  • In this spreadsheet (KBAC with Permutation Testing with RefSeq Genes 105 Interim v1, NCBI), right click on the -log10 P-Value (One-Sided) column header (Column 8) and choose Plot Variable in GenomeBrowse.
  • In the chromosome chooser dropdown near the upper-left corner of the plot window of the resulting plot, enter 5 – 6. Then click the right-arrow icon  just to the right of the genomic coordinate search bar near the top of the plot window.
  • In the Plot Tree, click on the first -log10 P-Value (One-Sided) node. Choose the Add tab under Controls and select Add Plot Item(s) to open the Add Data Sources dialog.
  • Click the Project button and select the KBAC with Permutation Testing with RefSeq Genes 105 Interim v1, NCBI spreadsheet.
  • In the list of plot items, go to -log10 P-Value (Two-Sided) and check it. Click Plot & Close.
  • Select the -log10 P-Value (Two-Sided) plot item and go to the Style tab of the Controls dialog. Change the color to green by clicking on the blue box and selecting green from the color options.

The plot should look something like Figure 4-2.

-log10 P-values from KBAC with Permutation Testing--one-sided, blue;  two-sided, green
Figure 4-2: -log10 P-values from KBAC with Permutation Testing–one-sided, blue; two-sided, green

NOTE: Your results may differ slightly since these results are dependent upon permutation testing.

Note that for the more significant values in this test, one-sided and two-sided tests come up with the same answers. To see this better, drag the -log10 P-Value (Two-Sided) node in the Plot Tree and drop it just below the -log10 P-Value (One-Sided) node. To see this better yet, alternately check and uncheck the -log10 P-Value (Two-Sided) node (now that you have this node on the bottom).

These same answers occur because the one-sided KBAC test is weighted toward results where the multi-marker genotypes have higher sample risks and away from results where the multi-marker genotypes have lower sample risk. This pushes the distribution almost completely into the area of positive test values, so that the two-sided KBAC test, which is simply the square of the one-sided KBAC test, will come up with the same p-values for all tests except those having the least significance.

This is why checking One-sided statistics (recommended) is normally recommended. The intrinsic one-sidedness of the KBAC test is discussed further after the next section.

KBAC with Regression

KBAC also offers a regression-based algorithm to allow correction for covariates.

Unlike the situation with the CMC method and CMC with Hotelling vs. CMC with regression, the basic statistic underlying KBAC with regression is completely equivalent to the basic statistic underlying KBAC with permutation testing when there is no correction for covariates. The only p-value differences that will arise when there are no covariates come from the fact of using permutation testing to obtain p-values.

  • From Phenotypes + 1kG Exomes – Sheet 3 choose DNA-Seq > Collapsing Methods > KBAC with Regression.
  • Check Adaptive permutation testing (threshold alpha): and leave the default .01 value.
  • Check Two-sided statistics under Outputs, so that One-sided statistics (recommended) and Two-sided statistics are both checked.
  • Under Regions, the RefSeq Genes 105 Interim v1, NCBI track should already be selected by default.
  • The dialog should look like Figure 4-3. Click OK to run KBAC with regression.
KBAC with Regression dialog
Figure 4-3: KBAC with Regression dialog

This may take somewhat more time because the regression mechanism that is in place for correcting for covariates is being used, even though we are not actually correcting for covariates.

The resulting spreadsheet, KBAC Regression with RefSeq Genes 105 Interim v1, NCBI, also uses row numbers (indexes into the marker map) for row labels and also, for each gene region, reports the chromosome, the start and end positions, the gene name, the transcript name(s), the strand, the p-value, the -log10 P-Value and various statistics and multiple-testing-corrected results for both the one-sided KBAC and two-sided KBAC tests.

Now add these results to the KBAC with Permutation Testing plot for comparison.

  • Go to this plot (called Plot of Column -log10 P-Value (One-Sided) from KBAC with Permutation Testing with RefSeq Genes 105 Interim v1, NCBI).
  • Right-click on the second -log10 P-Value (One-Sided) plot node and click Edit Title…. Rename this node to -log10 P-Value (OS Perm Testing).
  • To declutter the plot, uncheck the -log10 P-Value (Two-Sided) node.
  • Click on the (remaining) -log10 P-Value (One Sided) node in the Plot Tree. Choose the Add tab and select the Add Plot Item(s) button.
  • Click the Project button and select the KBAC Regression with RefSeq Genes 105 Interim v1, NCBI spreadsheet. Check -log10 P-Value (One-Sided) on the list. Click Plot & Close.
  • Right-click on the new (second) -log10 P-Value (One-Sided) item and rename it to -log10 P-Value (OS Regression). Then change the color to orange on the Style tab of the Controls dialog.

The plot should look something like Figure 4-4.

KBAC with Permutation Testing in blue, KBAC with Regression in orange
Figure 4-4: KBAC with Permutation Testing in blue, KBAC with Regression in orange

Notice that the results are basically the same, with minor fluctuations due to permutation testing. In a few cases, the results are exactly the same.

NOTE: If you were to add the -log10 P-Value (Two-Sided) from the KBAC with Regression spreadsheet, you would find its results agreeing with the regression one-sided results for the more significant p-values, just as they did between one-sided and two-sided KBAC with Permutation Testing.

Now, cross-check the KBAC results with the CMC results–specifically, with the CMC with Hotelling’s T2 test.

  • Click on the (remaining) -log10 P-Value (One-Sided) node in the Plot Tree. Choose the Add tab and select Add Plot Item(s)
  • Click the Project button and select the CMC with Hotelling T Squared Test with RefSeq Genes 105 Interim v1, NCBI spreadsheet, then check -log10 P-Value from the list and click Plot & Close.
  • Right-click on the new -log10 P-Value item, rename it to -log10 P-Value (CMC Hotelling), and use the Style tab to change this plot item’s color to black.

The plot should look similar to Figure 4-5.

KBAC (one-sided) with Permutation Testing, blue; KBAC (one-sided) with Regression, orange; CMC with Hotelling's, black
Figure 4-5: KBAC (one-sided) with Permutation Testing, blue; KBAC (one-sided) with Regression, orange; CMC with Hotelling’s, black

Note that there are two positions where the CMC with Hotelling’s method has the most significant results, one in Chromosome 5 and one in Chromosome 6. Let’s zoom in and examine the significant result in Chromosome 5.

  • Zoom in on this gene by selecting the magnifying glass icon on the  zoom toolbar, then pressing your left mouse button to the left of this gene and while holding down your left mouse button dragging the cursor across to the right of this gene, finally releasing the mouse button to complete the zoom. Repeat this process until the dot representing this gene has become a horizontal oval taking up almost the whole left half of the plot area. (Left half, since we will also want to view the next two results that will be to the right.)

NOTE: If you need to re-do any zoom, use the fat left arrow icon on the zoom toolbar to go back to any previous zoom level.

When finished, the plot window should look something like Figure 4-6.

Zoom showing the KIF3A and SEPT8 genes
Figure 4-6: Zoom showing the KIF3A and SEPT8 genes

As you see, each p-value in the plot corresponds to an entire gene–the oval representing any p-value and gene will stretch out over the whole genomic extent of that gene.

We will be examining the results for these genes, which are KIF3A and SEPT8.

First, note that for KIF3A, the -log10 P for one-sided KBAC Permutation Testing and for one-sided KBAC with Regression are both about 3.0, which corresponds with a p-value of .001 (1 divided by 1000).

  • You may need to uncheck -log10 P-Value (OS Perm Testing) to see the result for -log10 P-Value (OS Regression) more clearly (or at all).
  • Check -log10 P-Value (Two-Sided) to reveal another KBAC result for gene KIF3A for which -log10 P is approximately 3.0.

Of course, p = .001 is the best significance result that permutation testing with 1000 iterations can possibly achieve, and the fact that this is achieved for various KBAC tests on gene KIF3A is in agreement with the very significant p-value assigned to this gene by the CMC with Hotelling test.

Now, let us examine the results for gene SEPT8. We see that both the one-sided KBAC permutation testing result and the one-sided KBAC regression result show very little significance, even while the CMC result and the two-sided KBAC result do show some significance. Question: Why are the one-sided results showing so much less significance for this gene?

Remember that the KBAC method is intrinsically a one-sided test, meaning that it is meant to discover if the gene variants tested confer higher sample risk. Therefore, a -log10 P-Value close to zero for the one-sided test we have been doing may mean that there could be an association in the direction of the gene variant conferring lower sample risk rather than of conferring higher sample risk.

We will now check to see if there is a significant association with any gene variants in the sense of conferring lower sample risk rather than higher sample risk.

KBAC for Lower Sample Risk

To find out, you need to invert the case/control status and test again.

  • Open Phenotypes + 1kG Exomes – Sheet 3 and choose Edit > Edit This Spreadsheet.
  • Right-click on the header of the Case? column.
  • Choose Transform > Invert Boolean Values.
  • Click OK to overwrite the column.
  • Choose File > Save.
  • After Dataset Name: enter Inverted Phenotypes + 1kG Exomes.
  • Click OK.

Now, re-perform KBAC with Regression.

  • From the new spreadsheet, left-click once on the not(Case?) column to signify its dependent status.
  • Choose DNA-Seq > Collapsing Methods > KBAC with Regression.
  • This time, in Permutations under Parameters for KBAC (Liu & Leal, 2010) Logistic Regression, change the Number of permutations to use: to 10000
  • Check Adaptive permutation testing (threshold alpha):, and change the (threshold alpha) in the blank on the right to .001.
  • Under Regions, the RefSeq Genes 105 Interim v1, NCBI track should already be selected by default.
  • Leave the rest of the defaults and click OK.

This may take almost twice as long, but it will be sensitive to p-values down to .0001 rather than just down to .001. To compensate, the alpha threshold was adjusted down, so as to spend less time on any but the most significant gene regions.

Now add these results to the previously created plot to compare.

  • Using the Plot of Column -log10 P-Value (One-Sided) from KBAC with Permutation Testing with RefSeq Genes 105 Interim v1, NCBI plot, click on the -log10 P-Value (One-Sided) node in the Plot Tree. On the Add tab of the Controls dialog, choose the Add Plot Item(s) button. Click the Project button on the left side of the Add Data Sources window.
  • Choose the latter occurrence of KBAC Regression with RefSeq Genes 105 Interim v1, NCBI (to get the result spreadsheet just created) and check -log10 P-Value (One-Sided) from the list. Then click Plot & Close.
  • Right-click on the new (second) -log10 P-Value (One-Sided) item and rename it to -log10 P-Value (KBAC Lower Risk).
  • Change the color of the data points to red on the Style tab of the Controls dialog.

The plot should look similar to Figure 4-7.

KIF3A and SEPT8 genes showing KBAC (two-sided), green; KBAC (one-sided) with Permutation Testing, blue; KBAC (one-sided) with Regression, orange; CMC with Hotelling's, black; KBAC testing for lower risk, red
Figure 4-7: KIF3A and SEPT8 genes showing KBAC (two-sided), green; KBAC (one-sided) with Permutation Testing, blue; KBAC (one-sided) with Regression, orange; CMC with Hotelling’s, black; KBAC testing for lower risk, red

While we have seen that the KIF3A gene is definitely significant in the direction of higher risk, we see that it is found to be not significant at all in the direction of lower risk, which is self-consistent.

Meanwhile, let’s look over at the SEPT8 gene. We see that according to one-sided KBAC permutation testing, it has no significance at all in the direction of higher risk, even while it has some significance according to Hotelling CMC. Let’s declutter this portion of the plot.

  • Uncheck the -log10 P-Value (CMC Hotelling)-log10 P-Value (OS Perm Testing), and -log10 P-Value (Two-Sided) plot nodes.

Now, it’s abundantly clear that, by contrast to gene KIF3A, gene SEPT8 has a certain level of significance in the direction of lower risk and basically no significance in the direction of higher risk, which is also self-consistent (See Figure 4-8). Both of these results are also consistent with a CMC Hotelling test picking up significance for both genes. (See Figures 4-7 and 4-8.)

KBAC test for higher risk, orange; KBAC test for lower risk, red
Figure 4-8: KBAC test for higher risk, orange; KBAC test for lower risk, red
  • Enter 5 – 6 in the chromosome chooser dropdown and click the right-arrow icon to get the bigger picture.

You will then see that while, for our test data, not that many genes are very significant for lower risk, it is still true that those genes that are the most significant for lower risk are not at all the same as those genes that are most significant for higher risk.

  • Click the fat left arrow icon to return to the view of Figure 4-8.

5. SKAT-O and Generalized SKAT

Generalized Sequence Kernel Association Test (Generalized SKAT)[Lee2012] and [Liu2009] is a test which combines the test statistics of the following, according to a user-specified ratio Rho:

  • Burden testing, which collapses the variant data within a region by summing the minor allele counts for each marker in the region, and testing this against the phenotype. By contrast to Count Number of Variants (Per Gene), the counts are usually weighted by a function of each marker’s minor allele frequency (MAF), so as to establish a contrast between rare and common variants.
  • Sequence Kernel Association Test (SKAT), a test which collapses the variant data within a region by summing the squares of score statistics for testing individual markers. Just as for Burden testing, weights based on each marker’s MAF are usually used to establish a contrast between rare and common variants. In this case, if weighting is used, the individual squares of score statistics are weighted before they are summed.

The ratio Rho of 1 corresponds to a pure Burden test, and a ratio Rho of 0 corresponds to purely an (original) SKAT test.

Optimized SKAT (SKAT-O) [LeeWuLin2012] is a procedure which optimizes Generalized SKAT over a grid of N values of Rho between zero and one, inclusive, in such a way as to count as only one test for multiple testing purposes instead of as N tests. (In Golden Helix SVS, seven grid points are used (N = 7), so we are talking about avoiding having to multiply the number of tests by 7 to get a proper multiple testing correction.)

Burden Testing using the SKAT-O Feature

First, we will run a Burden test (Generalized SKAT with Rho = 1).

  • From Phenotypes + 1kG Exomes – Sheet 3 choose DNA-Seq > Collapsing Methods > SKAT-O.
  • Check Generalized SKAT with rho = and select a rho value of 1.
  • Uncheck SKAT-O with standard grid of rho values.
  • Under Regions, the RefSeq Genes 105 Interim v1, NCBI track should already be selected by default.
  • Under Algorithm for Estimating Distributions, check Somewhat-small-sample corrected (kurtosis estimated analytically).
  • The dialog should now look like Figure 5-1. Click OK.
SKAT-O dialog for Burden Testing
Figure 5-1: SKAT-O dialog for Burden Testing

  • Rename the resulting spreadsheet to Gen. SKAT test with Rho = 1.

This result spreadsheet uses row numbers (indexes into the marker map) for row labels, and, for each gene region, reports the chromosome, the start and end positions, the gene name, the transcript name(s), the strand, the SKAT p-value, the -log10 P-Value for SKAT and various other statistics and multiple-testing-corrected results.

Now, plot the -log10 p-values to investigate what associations there are between rare variants and the phenotype according to Burden testing.

  • In the resulting spreadsheet (renamed above), right click on the -log10 SKAT P-Value column header (Column 8) and choose Plot Variable in GenomeBrowse.
  • In the chromosome chooser dropdown near the upper-left corner of the plot window of the resulting plot, enter 5 – 6. Then click the right-arrow icon  just to the right of the genomic coordinate search bar near the top of the plot window.
  • In the Plot Tree, right-click on the second -log10 SKAT P-Value node and rename it to -log10 P-Value (Burden)
  • In the Plot Tree, click on the (remaining) -log10 SKAT P-Value node. Choose the Add tab under Controls and select Add Plot Item(s) to open the Add Data Sources dialog.
  • Click the Project button and select the CMC with Hotelling T Squared Test with RefSeq Genes 105 Interim v1, NCBI spreadsheet.
  • In the list of plot items, go to -log10 P-Value and check it. Click Plot & Close.
  • Right-click on the -log10 P-Value plot item and rename it to -log10 P-Value (CMC Hotelling). Then change the color to orange on the Style tab of the Controls dialog.

The plot should look something like Figure 5-2.

-log10 P-values from Burden Testing, blue; -log10 P-Values from CMC Hotelling, orange
Figure 5-2: -log10 P-values from Burden Testing, blue; -log10 P-Values from CMC Hotelling, orange

We see that Burden testing is finding the same associations as does the CMC Hotelling test. Genes that are significant for Burden testing are also significant for the CMC Hotelling test.

SKAT and Generalized SKAT

Next, we will run an (original) SKAT test (Generalized SKAT with Rho = 0).

  • From Phenotypes + 1kG Exomes – Sheet 3 choose DNA-Seq > Collapsing Methods > SKAT-O.
  • Check Generalized SKAT with rho = and select a rho value of 0.
  • Uncheck SKAT-O with standard grid of rho values.
  • Under Algorithm for Estimating Distributions, check Somewhat-small-sample corrected (kurtosis estimated analytically). Then click OK.
  • Rename the resulting spreadsheet to Gen. SKAT test with Rho = 0.

Now, we will run a Generalized SKAT test with a Rho value somewhere in the middle between zero and one. Based on the (somewhat logarithmic) grid of values that Golden Helix SVS SKAT-O uses, the values of which are 0, 0.01, 0.04, 0.09, 0.25, 0.5, and 1, we will choose Rho = .25.

  • From Phenotypes + 1kG Exomes – Sheet 3 choose DNA-Seq > Collapsing Methods > SKAT-O.
  • Check Generalized SKAT with rho = and select a rho value of .25.
  • Uncheck SKAT-O with standard grid of rho values.
  • Under Algorithm for Estimating Distributions, check Somewhat-small-sample corrected (kurtosis estimated analytically). Then click OK.
  • Rename the resulting spreadsheet to Gen. SKAT test with Rho = .25.

Both of these result spreadsheets, now called Gen. SKAT test with Rho = 0 and Gen. SKAT test with Rho = .25, have the same row labels, column headers, and reported fields as the Burden test result spreadsheet, now called Gen. SKAT test with Rho = 1.

Let’s compare these results with Burden tests and CMC Hotelling tests. First, add the Generalized SKAT results for Rho = .25.

  • Using the Plot of Column -log10 SKAT P-Value from Gen. SKAT test with Rho = 1 plot, click on the -log10 SKAT P-Value node in the Plot Tree. On the Add tab of the Controls dialog, choose the Add Plot Item(s) button. Click the Project button on the left side of the Add Data Sources window.
  • Choose Gen. SKAT test with Rho = .25 and check -log10 SKAT P-Value from the list. Then click Plot & Close.
  • Right-click on the new -log10 SKAT P-Value item and rename it to -log10 P-Value (Rho = .25).
  • Change the color of the data points to turquoise on the Style tab of the Controls dialog.

Now, repeat this procedure for the (original) SKAT results (Rho = 0).

  • Click again on the -log10 SKAT P-Value node in the Plot Tree. On the Add tab of the Controls dialog, choose the Add Plot Item(s) button. Click the Project button on the left side of the Add Data Sources window.
  • Choose Gen. SKAT test with Rho = 0 and check -log10 SKAT P-Value from the list. Then click Plot & Close.
  • Right-click on the new -log10 SKAT P-Value item and rename it to -log10 P-Value (Original SKAT).
  • Change the color of the data points to green on the Style tab of the Controls dialog.

The plot should look similar to Figure 5-3.

-log10 P-values from Burden Testing, blue; -log10 P-Values from Generalized SKAT with Rho = .25, turquoise; -log10 P-Values from original SKAT, green; -log10 P-Values from CMC Hotelling, orange
Figure 5-3: -log10 P-values from Burden Testing, blue; -log10 P-Values from Generalized SKAT with Rho = .25, turquoise; -log10 P-Values from original SKAT, green; -log10 P-Values from CMC Hotelling, orange

We see that, in general, a significant result for any of the tests in this plot implies a significant result for all the other tests in this plot. This is as would be expected.

Let us look specifically at the three Generalized SKAT tests that use different Rho values.

Because the amount of significance should partly depend on whether the direction of effect is different for different markers within the same gene region, we expect to see different tests being the most significant for different genes.

  • Declutter the plot by unchecking -log10 P-Value (CMC Hotelling) in the Plot Tree. The plot should look similar to Figure 5-4.
-log10 P-values from Burden Testing, blue; -log10 P-Values from Generalized SKAT with Rho = .25, turquoise; -log10 P-Values from original SKAT, green
Figure 5-4: -log10 P-values from Burden Testing, blue; -log10 P-Values from Generalized SKAT with Rho = .25, turquoise; -log10 P-Values from original SKAT, green

Here, we can see this in action. The Burden test (dark blue) has the best result for some genes, the original SKAT test (green) has the best result for other genes, and, for a few genes, the best result is from the Generalized SKAT test using Rho = .25 (turquoise).

The SKAT-O Test

Finally, we will run the SKAT-O test. To save time, we will only run this for Chromosome 6.

  • Open the Phenotypes + 1kG Exomes – Sheet 3 spreadsheet and choose Select > Activate by Chromosomes. Uncheck 5 and leave 6 checked. Click OK. This will create spreadsheet Phenotypes + 1kG Exomes – Sheet 4.
  • Choose DNA-Seq > Collapsing Methods > SKAT-O.
  • (Leave Generalized SKAT with rho = unchecked and leave SKAT-O with standard grid of rho values checked.)
  • Under Algorithm for Estimating Distributions, check Somewhat-small-sample corrected (kurtosis estimated analytically).
  • The dialog should now look like Figure 5-5. Click OK.
Dialog for the SKAT-O Test Itself
Figure 5-5: Dialog for the SKAT-O Test Itself
  • (Leave the name of the resulting spreadsheet as is.)

The resulting spreadsheet, SKAT-O Testing with RefSeq Genes 105 Interim v1, NCBI (Somewhat-small-sample corrected), uses row numbers (indexes into the marker map) for row labels, and, for each gene region, reports the chromosome, the start and end positions, the gene name, the transcript name(s), the strand, the SKAT-O p-value, the -log10 SKAT-O P-Value and other various statistics and multiple-testing-corrected results.

Now, let us compare SKAT-O with the other results.

  • Using the Plot of Column -log10 SKAT P-Value from Gen. SKAT test with Rho = 1 plot, click on the -log10 SKAT P-Value node in the Plot Tree. On the Add tab of the Controls dialog, choose the Add Plot Item(s) button. Click the Project button on the left side of the Add Data Sources window.
  • Choose SKAT-O Testing with RefSeq Genes 105 Interim v1, NCBI (Somewhat-small-sample corrected) and check -log10 SKAT-O P-Value from the list. Then click Plot & Close.
  • Change the color of the data points to red on the Style tab of the Controls dialog.

The SKAT-O results (which are only for Chromosome 6) will show in the right half of the plot.

  • Go to the chromosome chooser dropdown near the upper-left corner of the plot window and click its down-arrow. Choose 6. The plot should look similar to Figure 5-6.
-log10 P-values from Burden Testing, blue; -log10 P-Values from Generalized SKAT with Rho = .25, turquoise; -log10 P-Values from Original SKAT, green; -log10 P-Values from SKAT-O, red
Figure 5-6: -log10 P-values from Burden Testing, blue; -log10 P-Values from Generalized SKAT with Rho = .25, turquoise; -log10 P-Values from Original SKAT, green; -log10 P-Values from SKAT-O, red

We see that the SKAT-O (red) p-values don’t always reach the significance of the best Generalized SKAT p-values, but they do almost always track the best Generalized SKAT p-value, whichever one that is, with each SKAT-O result having a -log10 P that is about 0.6 less than that of the best Generalized SKAT -log10 P for that gene.

Being consistent with the best Generalized SKAT p-value is what was intended for the SKAT-O test.

6. Compare Results with GWAS Approach

To demonstrate the utility of the collapsing tests, we can compare the results to a GWAS test that assesses association with each individual SNP.

Attempt an Association Test on Individual Rare Variants

  • Go back to the Phenotypes + 1kG Exomes – Sheet 3 spreadsheet. Choose Genotype > Genotype Association Tests.
  • Choose an Additive model, and choose Correlation/Trend test as the Test Statistic or Method.
  • Select Output -log10(P) under Additional Outputs.
  • The dialog should look like Figure 6-1. Click Run.
Genotype Association Tests dialog
Figure 6-1: Genotype Association Tests dialog

An association results spreadsheet is created. This spreadsheet contains marker-mapped results which include the marker name as a row label, Corr/Trend PCorr/Trend -log10 PCorr/Trend R, and other statistics and multiple-testing-corrected results.

  • From the resulting Association Tests (Additive Model) spreadsheet, right click on the Corr/Trend -log10 P column header and choose Plot Variable in GenomeBrowse.
  • In the chromosome chooser dropdown near the upper-left corner of the plot window, enter 5 – 6. Then click the right-arrow icon just to the right of the genomic coordinate search bar near the top of the plot window. The result should look like Figure 6-2.
-log10 P-values from GWAS for Chromosomes 5 and 6
Figure 6-2: -log10 P-values from GWAS for Chromosomes 5 and 6

Notice how many of the results collect into just a few discrete values. Let us take a look at one of these discrete values.

  • On the spreadsheet Association Tests (Additive Model), right-click on the Corr/Trend -log10 P column header and choose Sort Descending.

Note that 891 SNP’s all have exactly the same -log10 P-Value of 3.698670, the first one listed being 5:143219-SNV.

  • On the spreadsheet Phenotypes + 1kG Exomes – Sheet 3, go to the column header of 5:143219-SNV (which is just at Column 13), then right-click and choose Sort Descending.

You will notice there is only one variant (A_G instead of A_A) out of all the samples. Since this corresponds to a case, this relatively significant p-value number results. Similarly, the other discrete values shared by many individual results are also based on only one or a very few variants.

Of course, the phrase “only one or a very few variants” fits the definition of “rare” variants.

Compare with Rare Variant Methods

So, let’s compare the association test results with those from several methods that were actually designed for rare variant analysis.

  • Go to the Plot of Column Corr/Trend -log10 P from Association Tests (Additive Model) plot and click on the first Corr/Trend -log10 P in the Plot Tree. Choose the Add tab and select Add Plot Item(s). Click the Project button on the left side of the Add Data Sources window.
  • Choose CMC with Hotelling T Squared Test with RefSeq Genes 105 Interim v1, NCBI spreadsheet and check -log10 P-Value from the list. Click Plot & Close.
  • Rename the resulting plot node to -log10 P-Value (CMC Hotelling) and use the Style tab to change the color to orange.

Follow the same process for adding one-sided KBAC Regression results to the plot.

  • Click on the first Corr/Trend -log10 P in the Plot Tree. Choose the Add tab and select Add Plot Item(s). Click the Project button on the left side of the Add Data Sources window.
  • Choose the first occurrence of KBAC Regression with RefSeq Genes 105 Interim v1, NCBI (to get the result from the KBAC testing for higher risk), then check -log10 P-Value (One-Sided) and click Plot & Close.
  • Rename this plot node to -log10 P-Value (OS KBAC Higher Risk) and use the Style tab to change the color to gray.

And follow a similar process to add the SKAT-O results to the Chromosome 6 section of the plot.

  • Click on the first Corr/Trend -log10 P in the Plot Tree. Choose the Add tab and select Add Plot Item(s). Click the Project button on the left side of the Add Data Sources window.
  • Choose SKAT-O Testing with RefSeq Genes 105 Interim v1, NCBI (Somewhat-small-sample corrected), then check -log10 SKAT-O P-Value and click Plot & Close.
  • Click on this new plot node, then use the Style tab to change the color to red.

The plot should now look approximately like Figure 6-3.

-log10 P-values from GWAS in blue, from CMC Hotelling in orange, from KBAC testing for higher risk in gray, and from SKAT-O testing (in Chromosome 6) in red
Figure 6-3: -log10 P-values from GWAS in blue, from CMC Hotelling in orange, from KBAC testing for higher risk in gray, and from SKAT-O testing (in Chromosome 6) in red

You will note that there are only a few places where a rare variant method yields more significant results than the repetitive association test result of -log10 P = 3.698670 (that we mentioned above), and even not that many places where a rare variant method yields results more significant than the other repetitive association test result of -log10 P = (about) 1.8.

Let us look at gene SLC2A12 as an example of individual vs. collective significance.

  • Enter SLC2A12 into the Genomic Location Bar near the top of the GenomeBrowse window. A drop-down will appear (see Figure 6-4). In this drop-down, click on the first SLC2A12 entry. The result should be about like Figure 6-5.
Drop-down for selecting gene SLC2A12
Figure 6-4: Drop-down for selecting gene SLC2A12
Close-up (for gene SLC2A12) of results from GWAS in blue, from CMC Hotelling in orange, from KBAC testing for higher risk in gray, and from SKAT-O testing in red
Figure 6-5: Close-up (for gene SLC2A12) of results from GWAS in blue, from CMC Hotelling in orange, from KBAC testing for higher risk in gray, and from SKAT-O testing in red

We see that, based on the GWAS test, one of the SNPs in this region has a very significant association, two other SNPs in this same region have more or less significant associations, while all the others in this same region have either not much or very little significance. But we see that when taken together and tested using any one of these three collapsing methods, this group of variants shows high significance.

(While the KBAC method may appear to be showing quite a bit less significance than the other methods, it is, as we have mentioned before, showing the maximum significance that it possibly could using 1000 permutations.)

One factor that helps the rare variant methods find more significance is the fact that in any given region, some invidivual SNPs are monomorphic for the wild type, and thus will not show up in the GWAS results, even while these same monomorphic SNPs do factor into the rare variant analysis algorithms.

If you account for the Bonferroni correction, which is based on a smaller number for the collapsing methods than for the GWAS association test, the difference between the individual and the collective effects will be more pronounced.

Side Note: What About the Direction of Effect?

Since we have said that some tests are sensitive to the direction of effect, let’s take a look.

Direction of Effect for KBAC

First, the two one-sided KBAC tests.

  • Go to the Plot of Column -log10 P-Value (One-Sided) from KBAC with Permutation Testing with RefSeq Genes 105 Interim v1, NCBI, and click the Plot button in the upper-left corner of the plot window.
  • Click the Project button on the left side of the Add Data Sources window.
  • Choose Association Tests (Additive Model) and check Corr/Trend R. (Not Corr/Trend P, but instead Corr/Trend R.)
  • Click Plot & Close.

This will create, above the existing plot, a whole new plot consisting of the Corr/Trend R value from the individual-variant association test. This R value indicates the direction of effect that each individual SNP is manifesting with our phenotype. See Figure 6-6. (Note: You may wish to stretch the lower boundary of the plot window to see the gene names.)

Close-up for Genes KIF3A and SEPT8, with Upper plot: Corr/Trend R for Individual SNP's; Lower Plot: KBAC test for higher risk, orange; KBAC test for lower risk, red
Figure 6-6: Close-up for Genes KIF3A and SEPT8, with Upper plot: Corr/Trend R for Individual SNP’s; Lower Plot: KBAC test for higher risk, orange; KBAC test for lower risk, red

We see that for the gene on the left (which is KIF3A), while a number of markers have a subtle effect in the direction of lower risk, the overwhelming effect direction is from the four markers that have a more pronounced effect in the direction of higher risk. This is reflected in the two KBAC results, with the result for higher risk being highly significant and the result for lower risk being zero.

Meanwhile, for the other genes on the right (both SEPT8 and the shorter gene (CCNI2) which overlaps SEPT8), the only effect that any of the markers has is a subtle effect in the direction of lower risk. This is also reflected in the KBAC results, with the results for lower risk being more significant than the results for higher risk, with the higher-risk result for SEPT8 being essentially zero.

Direction of Effect for Generalized SKAT

Now, let’s take a look at the Generalized SKAT results with different values of Rho, along with the SKAT-O results. You will remember that Burden testing (Rho = 1) is said to be more sensitive to marker effects when they are all in the same direction, while original SKAT (Rho = 0) is said to be sensitive to marker effects even if some markers have effects in the opposite direction from the effects of other markers.

  • Go to the Plot of Column -log10 SKAT P-Value from Gen. SKAT test with Rho = 1 and start with the same procedure. Click the Plot button in the upper-left corner of the plot window.
  • Click the Project button on the left side of the Add Data Sources window.
  • Choose Association Tests (Additive Model) and check Corr/Trend R.
  • Click Plot & Close. The plot should look like Figure 6-7.
Upper Plot: Corr/Trend R for Individual SNP's; Lower Plot: -log10 P-values from Burden Testing, blue; from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green; and from SKAT-O, red
Figure 6-7: Upper Plot: Corr/Trend R for Individual SNP’s; Lower Plot: -log10 P-values from Burden Testing, blue; from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green; and from SKAT-O, red
  • Now, as we have done in an earlier plot, enter SLC2A12 into the Genomic Location Bar near the top of the GenomeBrowse window, and click on the first SLC2A12 entry in the dropdown. The result should be like Figure 6-8.
Close-up for Gene SLC2A12--Upper Plot: Corr/Trend R for Individual SNP's; Lower Plot: -log10 P-values from Burden Testing, blue; from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green; and from SKAT-O, red
Figure 6-8: Close-up for Gene SLC2A12–Upper Plot: Corr/Trend R for Individual SNP’s; Lower Plot: -log10 P-values from Burden Testing, blue; from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green; and from SKAT-O, red

We see that while a couple of SNPs have a subtle effect in the direction of lower risk, most SNPs have somewhat varying degrees of, but significant, effects in the same direction as each other, that of higher risk. We see that Burden testing (Rho = 1), as does Generalized SKAT with Rho = .25, comes out somewhat more significant here.

  • Now, enter ELOVL4 into the Genomic Location Bar near the top of the GenomeBrowse window, and click on the first ELOVL4 entry in the dropdown. The result should be like Figure 6-9.
Close-up for Gene ELOVL4--Upper Plot: Corr/Trend R for Individual SNP's; Lower Plot: -log10 P-values from Burden Testing, blue; from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green; and from SKAT-O, red
Figure 6-9: Close-up for Gene ELOVL4–Upper Plot: Corr/Trend R for Individual SNP’s; Lower Plot: -log10 P-values from Burden Testing, blue; from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green; and from SKAT-O, red

Here, there are two SNPs with a slight effect in the direction of lower risk, but one SNP with a greater effect in the direction of higher risk. We see that these are different enough effects and effect directions for the original SKAT test (Rho = 0) to find more significance for this gene than any other SKAT-related test.

  • Finally, go to gene WDR46 in the same manner as above. The result should be like Figure 6-10.
Close-up for Gene WDR46--Upper Plot: Corr/Trend R for Individual SNP's; Lower Plot: -log10 P-values from Burden Testing, blue; from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green; and from SKAT-O, red
Figure 6-10: Close-up for Gene WDR46–Upper Plot: Corr/Trend R for Individual SNP’s; Lower Plot: -log10 P-values from Burden Testing, blue; from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green; and from SKAT-O, red

Here, we see an actual instance of a combination of SNP effects where collapsing using a Generalized SKAT test with Rho = .25 works better than collapsing using either a Burden test (Rho = 1) or an original SKAT test (Rho = 0).

Note that in all cases mentioned above, the SKAT-O result consistently tracks the best Generalized SKAT result, with each result having a SKAT-O -log10 P that is about 0.6 less than that of the best Generalized SKAT -log10 P for that gene.

Conclusion

There are unlimited possible workflows for analysis of sequence data. This tutorial demonstrates a few of the tools available for variant filtering, collapsing, and analysis. Other filtering tools are available in SVS, and we are in the process of developing more. If there is any useful information contained within the plot-viewer annotation data sources, we can create a script to filter variants based on that data. There are numerous annotation sources available in SVS for filtering of variants, and users may develop their own annotations as well. If you have suggestions for improvements to the program, or if you have any questions about how to use any of these functions, please contact us.

References

[Cohen2004]Cohen J. C., Kiss R. S., Pertsemlidis A., Marcel Y. L., McPherson R., Hobbs H. H. (2004), ‘Multiple rare alleles contribute to low plasma levels of HDL cholesterol.’ Science 305 (5684):869-72.
[Lee2012]Lee S, et al (2012) ‘Optimal Unified Approach for Rare-Variant Association Testing with Application to Small-Sample Case-Control Whole-Exome Sequencing Studies’ Am J Hum Genet 91:224–237
[LeeWuLin2012]Lee S, Wu MC, Lin X (2012) ‘Optimal tests for rare variant effects in sequencing association studies’ Biostatistics 13, 4, pp. 762-775
[Liu2009]Liu H, Tang Y, Zhang HH (2009) ‘A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables’ Computational Statistics and Data Analysis 53 (2009) 853-856
[Morgenthaler2007]Morgenthaler S, Thilly WG (2007). ‘A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST).’ Mutat Res. 615(1-2):28-56.
[Li2008]Li B, Leal S (2008). ‘Methods for Detecting Associations with Rare Variants for Common Diseases: Application to Analysis of Sequence Data’ Am J Hum Genet 83:311-321.
[Liu2010]Liu D, Leal S (2010). ‘A Novel Adaptive Method for the Analysis of Next-Generation Sequencing Data to Detect Complex Trait Associations with Rare Variants Due to Gene Main Effects and Interactions’ PLoS Genet 6(10): e1001156. doi:10.1371/journal.pgen.1001156.
Updated on March 22, 2021

Was this article helpful?

Related Articles

Leave a Comment