1. Home
  2. SVS
  3. SVS Tutorials
  4. Intermediate Tutorials
  5. SVS Regression with Covariates Tutorial

SVS Regression with Covariates Tutorial

Welcome to the SVS Regression with Covariates Tutorial!

_images/P-value-plot-Moving-Window.png

Updated: January 28, 2021

Level: Intermediate

Version: 8.8.3 or higher

Packages: SNP Analysis, CNV Analysis, RNA-Seq Analysis, Power Seat

This tutorial provides an in-depth look at the SVS 8 Regression Module. It will cover controlling for confounding variables, model comparison, and regression on interactions. The spreadsheet of genotypic data included in the project file contains PCA-corrected numeric SNP columns for an additive model. The phenotype information is simulated.

If you are interested in learning how to set up a linear regression model, please read our blog post on the topic. (But note that now, contrary to what this blog post says, regression analysis in SVS may also be performed directly on genotypic spreadsheet columns.)

Requirements

To complete this tutorial you will need to download and unzip the following file, which includes a Golden Helix Project File.

SVS_Regression_Tutorial.zip

Included in the above ZIP file:

Regression_Tutorial – Contains an SVS Project containing mapped PCA-corrected genotype information and a phenotype spreadsheet.

We hope you enjoy the experience and look forward to your feedback.

1. Overview of the Project

Regression analysis is a useful and necessary tool in any researcher’s toolbox. This tutorial provides several workflows to showcase different ways to use regression analysis in a genotypic model assessment setting. Also included is an overview of how to interpret the Regression Results output.

Data preparation

Open the SVS Project available for download on the first page of this tutorial. You should see two spreadsheets: a phenotype spreadsheet (HM_Sim_Pheno…) and a marker-mapped genotype spreadsheet (PCA-Corrected HM_500K_Geno…). The genotype spreadsheet contains numeric values from Chromosome 1 that have been corrected for 3 principal components. This data comes from HapMap samples and is thus tainted by population stratification.

  • Open the HM_Sim_Pheno Dataset – Sheet 1. You will need to join this spreadsheet with the genotype spreadsheet in order to perform regression.
  • Choose File > Join or Merge Spreadsheets. Select the PCA_Corrected HM_500K_Geno – Sheet 1 and click OK.
  • After New dataset name: type Pheno + Geno and under Spreadsheet as Child of choose Current spreadsheet. Click OK.

Now you should have the phenotypic and genotypic information in the same spreadsheet. First, we will use the phenotypic and genotypic information to predict case/control status in the full model. SVS will recognize case/control status as a binary dependent variable and automatically perform logistic regression. Using a continuous dependent variable does not change the model building steps; SVS would perform linear regression instead of logistic regression and the output would change accordingly.

Note

This tutorial demonstrates regression analysis on numeric spreadsheet columns. The merged spreadsheet for this tutorial contains miscellaneous numeric variables as well as the numeric data that resulted from applying the PCA correction to the original genotypic data.

However, regression analysis in SVS may also be performed directly on genotypic spreadsheet columns. To do so, select Genotype > Genotypic Regression Analysis. In addition to the two tabs for setting regression and output parameters which are demonstrated later in this tutorial, there is a third tab which will allow setting of genotype-related parameters. These include which genotype model to use and how to deal with missing genotype data.

Currently, multivariate regression (more than one dependent variable in the model) is not supported. For this feature, a single dependent column must be indicated and must contain real-, integer-, or binary-values. If you are interested in building a model to predict more than one variable, performing several simple regressions is recommended. A categorical variable must be converted into a binary- or integer-valued variable if you wish to make it dependent in the model.

2. Controlling For Confounding Variables

If you build a model and the response is highly correlated with one or more of the predictors (or confounded by one or more variables), these predictors can suppress possible real associations. In the regression model we will be using, we will be testing whether or not the additive genetic model covariates predict the case/control status, which is known to be correlated with blood pressure. You will see that the –log10 P-values when testing the relationship between the two blood pressure phenotypes (sbp and dbp) and case status are very high and could explain other possible significant associations.

Full Model Only

First we will perform regression on every column using a model with an intercept and one predictor, or covariate, represented by each (numeric) column other than the column assigned as the dependent variable. This technique computes the significance of that predictor using the full model only (where what we think of as the “reduced model” consists of only the intercept). This technique will individually assess the relationship between the dependent variable and the covariate (predictor).

  • Open the Pheno + Geno – Sheet 1 spreadsheet. First you must indicate a dependent variable. To do this, left-click once on the Case/Control column header (the column should turn magenta). Choose Numeric > Numeric Regression Analysis.
  • Select Regress on each of the 40095 numeric columns.
  • Open the Output Parameters tab and check Output data for P-P/Q-Q plots. This option includes several extra columns in the resulting spreadsheet, including a –log10 P column, which is useful for plotting.
  • Confirm that your options match Figure 1 and Figure 2 and click Run.
Figure 1. Regression Options Main Tab
Figure 1: Regression Options Main Tab
Figure 2. Regression Options Output Tab
Figure 2: Regression Options Output Tab

After the regression runs, a Regression Results spreadsheet will appear. Because we will be doing several regression analyses, we will rename each node appropriately.

  • From the Project Navigator, right click on the Regression Results node and select Rename Node.
  • Rename Regression Results to Regression Results – All Columns.

Open the Regression Results – All Columns spreadsheet again and take a look at the –log10 Full-Model P column. The first row represents the result of regressing the blood pressure covariate (SBP) against the dependent variable, and, as you can see, it has a very large –log10 P value. Scroll over to the Odds Ratio column (column 10) and you will see that the odds of a case after an increase in 1 unit of SBP or DBP is about 1.1 times the odds before the increase. A 5 unit increase in blood pressure would correspond to a five-fold increase in the odds of that individual being a case. On closer inspection high blood pressure is very common among cases and is confounding the results. A plot of the results will allow for comparison with the model that accounts for SBP, as we will see later.

  • In the Regression Results – All Columns spreadsheet, right-click on the –log10 Full-Model P column (column 2) and choose Plot Variable in GenomeBrowse. Zoom into Chromosome 1 by double clicking on the cytoband track and the following plot (Figure 3) appears.
Figure 3. P-value plot all columns
Figure 3: P-value plot all columns

Notice that there are two areas of possible interest. Keep these in mind when we look at the p-values after correcting for SBP.

Full vs. Reduced Model

The next step is to correct for the one confounding variable (SBP) and compare the results to those from using the Full Model only. Correcting for the confounding variable adds it to both the reduced and full models of the regressions on all other covariates. For each of the other covariates, a regression will take place with the full model containing the intercept, the SBP predictor and that covariate and the reduced model containing just the intercept and SBP. The comparison of the results between the full and reduced models allows for the individual assessment of each covariate after accounting for SBP.

  • Open the Pheno + Geno – Sheet 1 spreadsheet. The Case/Control column should still be selected as the dependent variable. Select Numeric > Numeric Regression Analysis to bring up the Regression Analysis window.
  • This time choose Correct for covariate(s) underneath Regress on each of the 40095 numeric columns. The Reduced Model Covariates section should activate.
Figure 4. Regression window
Figure 4: Regression window
  • Next to the Reduced Model Covariates box, click Add Covariate. From the resulting list check SBP and click Add. The rest of the options should have remained the same from the last analysis. Confirm that your parameters match those in Figure 4 and click Run.
  • From the Project Navigator rename the Regression Results to Regression Results – Corrected for SBP.

Notice in the resulting spreadsheet that since SBP has now been used as a reduced-model covariate, it is no longer listed in the results spreadsheet as a predictor (full-model covariate).

Next, add these results to our first plot to compare the two analyses.

  • Open the Plot of Column -log10 Full-Model P from Regression Results – All Columns plot.
  • Click the top -log10 Full-Model P node in the Plot Tree, then in the Controls window on the Add tab click Add (Plot) Item(s).
  • From the Add Data Sources dialog, click the Project button (if it is not already selected), select the Regression Results – Corrected for SBP spreadsheet and click -log10 FvR Model P. The Add Data Sources dialog should look like Figure 5. Press Plot & Close.
Figure 5. Add Data Sources window
Figure 5. Add Data Sources window
  • Now change the color of the new data points by clicking the -log10 FvR Model P node and under the Style tab click the blue box to the right of Style: and change it to green.

Now, in Figure 6, we only see one area of interest in green compared to the two blue regions that showed high –log10 P values. The genetic region that was significant in the previous analysis is probably also associated with high blood pressure and was thus confounding the results. The region that shows an association has even larger –log10 P values after correcting for this confounding variable.

Figure 6. P-value plot comparison
Figure 6: P-value plot comparison

3. Accounting for SNP Correlation

If the regression analysis is to use the entire genome as predictors, then it is recommended to select the “Use a moving window of regressors” option. A moving window incorporates consecutive SNPs, which might be in high linkage disequilibrium, providing information not captured with single variable regression. Consecutive SNPs can also be incorporated by detecting Haplotype blocks to use for regression, or by using Haplotype Trend Regression.

For this analysis, we will use a dynamic moving window the size of which is determined according to a fixed number of base pairs.

Using a Moving Window

As SBP is a known confounding variable, it is included as a reduced model covariate for this analysis.

  • Open the Pheno + Geno – Sheet 1 spreadsheet. The Case/Control column should still be selected as the dependent variable. Select Numeric > Numeric Regression Analysis to bring up the Regression window.
  • This time choose the Regress on a moving window with parameters: radio button under Selection Parameters. Then choose Dynamic window over 40083 mapped numeric columns with size 10000 base pairs and with max markers: 20 checked.
  • Check Correct for covariate(s) under Size: …. and add SBP as a Reduced Model Covariate in the same manner as before. Confirm that your options match those in Figure 7 and click Run.
Figure 7. Regression Options (Moving Window)
Figure 7: Regression Options (Moving Window)
  • From the Project Navigator rename the Regression Results to Regression Results – Moving Window in the same way as before.
  • In the Regression Results – Moving Window spreadsheet, right-click on the –log10 FvR Model P column (column 2) and choose Plot Variable in GenomeBrowse. Then once again zoom into Chromosome 1.

As you can see from Figure 8, this plot is similar to the previous corrected regression. However, we see that the peak around cytoband 1q32.2 has a maximum –log10 P-value of around 8 on the y-axis, compared to before when it had a maximum of more than 10.

Figure 8. P-value plot (Moving Window)
Figure 8: P-value plot (Moving Window)

4. Finding SNP/Environment Interactions

SVS may also be used to determine if there is an interaction between a covariate and any other covariates. For this analysis, we will test to see if weight interacts with any SNPs to effect a change in our dependent variable. We will also simultaneously correct for SBP, as it is a known confounding variable.

  • Open the Pheno + Geno – Sheet 1 spreadsheet. The Case/Control column should still be selected as the dependent variable. Select Numeric > Numeric Regression Analysis to bring up the Regression window.
  • This time choose the Regress on covariate-column interactions (on 40095 numeric cols) radio button under Selection Parameters. The full-model section on the upper right will change its name to Covariate-Column Interactions (Full Model). In this section, press the Add Col Interaction button, select the covariate Weight (Lbs), and press Add.
  • Under Regress on covariate-column interactions (on 40095 numeric cols), select Correct for additional covariate(s). Add SBP as a Reduced Model Covariate in the same manner as before.

Notice, in Figure 9, that the model actually being used is explained by the display in the upper-right section and the display in the lower-right section. (You may need to widen the options window to see this displayed completely.) The one full-model covariate in each regression is the interaction between weight and the data of the current numeric column. The reduced model corrects not only for SBP, but for both the weight and the current numeric column as individual covariates.

Figure 9. Regression Options Main Tab (Covariate-Column Interactions)
Figure 9. Regression Options Main Tab (Covariate-Column Interactions)
Figure 10. Regression Options Output Tab (Covariate-Column Interactions)
Figure 10: Regression Options Output Tab (Covariate-Column Interactions)
  • Next, open the Output Parameters tab and check Bonferroni adjustment (on N covariates) and False Discovery Rate (FDR).
  • Finally, confirm that your options match those in Figure 9 and Figure 10 and click Run.
  • From the Project Navigator rename the Regression Results to Regression Results – CC Interactions in the same way as before.
  • In the Regression Results – CC Interactions spreadsheet, right-click on the -log10 FvR Model P column (column 2) and choose Plot Variable in GenomeBrowse. Then once again zoom into chromosome 1.

As can be seen from Figure 11, we do have some (mild) interactions, with a -log10 P-value of up to a little more than 5.

Figure 11. P-value plot (Covariate-Column Interactions)
Figure 11: P-value plot (Covariate-Column Interactions)

We may ask: Are any of these interactions significant on a chromosome-wide or genome-wide scale?

  • In the Regression Results – CC Interactions spreadsheet, right-click on the FvR Model P-Value column (column 1) and select Sort Ascending.
  • Observe the results for SNP_A-1795780 (row 10), which had the best full-vs-reduced-model p-value, namely 6.541 x 10-6. This would be significant by itself, and it seems like it could be at least chromosome-wide significant–however, to know for sure, let’s check the results of the multiple-test correction procedures.
  • Scroll to the right until you reach the Bonferroni P and FDR columns (columns 26 and 27). We see that for SNP_A-1795780, both the Bonferroni adjustment and the False Discovery Rate are 0.262, less significant than the 0.05 we would need for chromosome-wide (or genome-wide) significance.
  • Right-click on the FDR column and select Sort Ascending. We see that no other marker interaction has an FDR even as small as that for the interaction with SNP_A-1795780.

Therefore, we have not found any interaction with Weight (Lbs) that was significant on either a chromosome-wide or a genome-wide scale.

5. Using the Regression Module for Model Comparison

The SVS regression module can also be used for model selection. This is a more classical way of thinking about regression, namely determining the added effect of individual additional covariates using a full vs. reduced model test.

  • Open the Pheno + Geno – Sheet 1 spreadsheet and select Numeric > Numeric Regression Analysis to bring up the Regression window.
  • This time choose Perform single regression with selected covariates under Selection Parameters.

First, we note that for all regressions we have demonstrated thus far, any covariates selected or shown in the Reduced Model Covariates box are actually included in both the reduced and the full model of any regression, while the covariates selected or shown in the Full Model Covariates box are only included in the full model. Therefore, running a full vs. reduced model actually tests the additional effects of the Full Model Covariates over and above the effects of the Reduced Model Covariates.

For this example, we will do exactly one regression to find the additional effect that change in diastolic blood pressure (Chng in Dbp) has over and above the effects of covariates SBP and Previous Event.

  • Check Correct for covariate(s) and then under Reduced Model Covariates choose Add Covariate. Check both SBP and Previous Event and then click Add.
  • Next, under Full Model Covariates, add Chng in Dbp in the same manner as the Reduced Model Covariates were added. Make sure the window looks like Figure 12 and click Run.
Regression with selected covariates only
Figure 12. Regression Options (perform regression with selected covariates only)

This will test whether Chng in Dbp improves the fit of the model by explaining more of the dependent variable response after accounting for SBP and Previous Event. More specifically, the following models are used in the test:

Full Model:

\text{Case/Control}_i \sim \beta_0 + \beta_1*(\text{SBP}) + \beta_2*(\text{Previous Event}) + \beta_3*(\text{Chng in Dbp}) + \epsilon_i

Reduced Model:

\text{Case/Control}_i \sim \beta_0 + \beta_1*(\text{SBP}) + \beta_2*(\text{Previous Event}) + \epsilon_i

And the hypotheses are H0\beta_3 = 0

and Ha\beta_3 \neq 0

Figure 13. Regression Statistics Viewer
Figure 13: Regression Statistics Viewer

The Regression Statistics Viewer (Figure 13) shows a P-value of about 0.013 for the full vs. reduced model. This means that Chng in Dbp is useful in predicting Case/Control status after accounting for SBP and Previous Event. For more information about the additional outputs in the Regression Statistics Viewer, see the Regression Analysis chapter in the SVS manual.

Updated on March 22, 2021

Was this article helpful?

Related Articles

Leave a Comment