## Welcome to the SVS Regression with Covariates Tutorial!

**Updated**: January 28, 2021

**Level**: Intermediate

**Version**: 8.8.3 or higher

**Packages**: SNP Analysis, CNV Analysis, RNA-Seq Analysis, Power Seat

This tutorial provides an in-depth look at the SVS 8 Regression Module. It will cover controlling for confounding variables, model comparison, and regression on interactions. The spreadsheet of genotypic data included in the project file contains PCA-corrected numeric SNP columns for an additive model. The phenotype information is simulated.

If you are interested in learning how to set up a linear regression model, please read our blog post on the topic. (But note that now, contrary to what this blog post says, regression analysis in SVS may also be performed directly on genotypic spreadsheet columns.)

**Requirements**

To complete this tutorial you will need to download and unzip the following file, which includes a Golden Helix Project File.

Included in the above ZIP file:

**Regression_Tutorial** – Contains an SVS Project containing mapped PCA-corrected genotype information and a phenotype spreadsheet.

We hope you enjoy the experience and look forward to your feedback.

## 1. Overview of the Project

Regression analysis is a useful and necessary tool in any researcher’s toolbox. This tutorial provides several workflows to showcase different ways to use regression analysis in a genotypic model assessment setting. Also included is an overview of how to interpret the Regression Results output.

### Data preparation

Open the SVS Project available for download on the first page of this tutorial. You should see two spreadsheets: a phenotype spreadsheet (**HM_Sim_Pheno…**) and a marker-mapped genotype spreadsheet (**PCA-Corrected HM_500K_Geno…**). The genotype spreadsheet contains numeric values from Chromosome 1 that have been corrected for 3 principal components. This data comes from HapMap samples and is thus tainted by population stratification.

- Open the
**HM_Sim_Pheno Dataset – Sheet 1**. You will need to join this spreadsheet with the genotype spreadsheet in order to perform regression. - Choose
**File > Join or Merge Spreadsheets**. Select the**PCA_Corrected HM_500K_Geno – Sheet 1**and click**OK**. - After
**New dataset name:**type**Pheno + Geno**and under**Spreadsheet as Child of**choose**Current spreadsheet**. Click**OK**.

Now you should have the phenotypic and genotypic information in the same spreadsheet. First, we will use the phenotypic and genotypic information to predict case/control status in the full model. SVS will recognize case/control status as a binary dependent variable and automatically perform logistic regression. Using a continuous dependent variable does not change the model building steps; SVS would perform linear regression instead of logistic regression and the output would change accordingly.

**Note**

This tutorial demonstrates regression analysis on numeric spreadsheet columns. The merged spreadsheet for this tutorial contains miscellaneous numeric variables as well as the numeric data that resulted from applying the PCA correction to the original genotypic data.

However, regression analysis in SVS may also be performed directly on genotypic spreadsheet columns. To do so, select **Genotype > Genotypic Regression Analysis**. In addition to the two tabs for setting regression and output parameters which are demonstrated later in this tutorial, there is a third tab which will allow setting of genotype-related parameters. These include which genotype model to use and how to deal with missing genotype data.

Currently, multivariate regression (more than one dependent variable in the model) is not supported. For this feature, a single dependent column must be indicated and must contain real-, integer-, or binary-values. If you are interested in building a model to predict more than one variable, performing several simple regressions is recommended. A categorical variable must be converted into a binary- or integer-valued variable if you wish to make it dependent in the model.

## 2. Controlling For Confounding Variables

If you build a model and the response is highly correlated with one or more of the predictors (or confounded by one or more variables), these predictors can suppress possible real associations. In the regression model we will be using, we will be testing whether or not the additive genetic model covariates predict the case/control status, which is known to be correlated with blood pressure. You will see that the –log10 P-values when testing the relationship between the two blood pressure phenotypes (sbp and dbp) and case status are very high and could explain other possible significant associations.

### Full Model Only

First we will perform regression on every column using a model with an intercept and one predictor, or covariate, represented by each (numeric) column other than the column assigned as the dependent variable. This technique computes the significance of that predictor using the full model only (where what we think of as the “reduced model” consists of only the intercept). This technique will individually assess the relationship between the dependent variable and the covariate (predictor).

- Open the
**Pheno + Geno – Sheet 1**spreadsheet. First you must indicate a dependent variable. To do this, left-click once on the**Case/Control**column header (the column should turn magenta). Choose**Numeric > Numeric Regression Analysis**. - Select
**Regress on each of the 40095 numeric columns**. - Open the
**Output Parameters**tab and check**Output data for P-P/Q-Q plots**. This option includes several extra columns in the resulting spreadsheet, including a –log10 P column, which is useful for plotting. - Confirm that your options match
and**Figure 1**and click*Figure 2***Run**.

After the regression runs, a Regression Results spreadsheet will appear. Because we will be doing several regression analyses, we will rename each node appropriately.

- From the
**Project Navigator**, right click on the**Regression Results**node and select**Rename Node**. - Rename
**Regression Results**to**Regression Results – All Columns**.

Open the **Regression Results – All Columns** spreadsheet again and take a look at the –log10 Full-Model P column. The first row represents the result of regressing the blood pressure covariate (SBP) against the dependent variable, and, as you can see, it has a very large –log10 P value. Scroll over to the Odds Ratio column (column 10) and you will see that the odds of a case after an increase in 1 unit of SBP or DBP is about 1.1 times the odds before the increase. A 5 unit increase in blood pressure would correspond to a five-fold increase in the odds of that individual being a case. On closer inspection high blood pressure is very common among cases and is confounding the results. A plot of the results will allow for comparison with the model that accounts for SBP, as we will see later.

- In the
**Regression Results – All Columns**spreadsheet, right-click on the**–log10 Full-Model P**column (column 2) and choose**Plot Variable in GenomeBrowse**. Zoom into Chromosome 1 by double clicking on the cytoband track and the following plot () appears.**Figure 3**

Notice that there are two areas of possible interest. Keep these in mind when we look at the p-values after correcting for SBP.

### Full vs. Reduced Model

The next step is to correct for the one confounding variable (SBP) and compare the results to those from using the Full Model only. Correcting for the confounding variable adds it to both the reduced and full models of the regressions on all other covariates. For each of the other covariates, a regression will take place with the full model containing the intercept, the SBP predictor and that covariate and the reduced model containing just the intercept and SBP. The comparison of the results between the full and reduced models allows for the individual assessment of each covariate after accounting for SBP.

- Open the
**Pheno + Geno – Sheet 1**spreadsheet. The**Case/Control**column should still be selected as the dependent variable. Select**Numeric > Numeric Regression Analysis**to bring up the Regression Analysis window. - This time choose
**Correct for covariate(s)**underneath**Regress on each of the 40095 numeric columns**. The**Reduced Model Covariates**section should activate.

- Next to the
**Reduced Model Covariates**box, click**Add Covariate**. From the resulting list check**SBP**and click**Add**. The rest of the options should have remained the same from the last analysis. Confirm that your parameters match those inand click**Figure 4****Run**. - From the
**Project Navigator**rename the**Regression Results**to**Regression Results – Corrected for SBP**.

Notice in the resulting spreadsheet that since SBP has now been used as a reduced-model covariate, it is no longer listed in the results spreadsheet as a predictor (full-model covariate).

Next, add these results to our first plot to compare the two analyses.

- Open the
**Plot of Column -log10 Full-Model P from Regression Results – All Columns**plot. - Click the top
**-log10 Full-Model P**node in the Plot Tree, then in the Controls window on the**Add**tab click**Add (Plot) Item(s)**. - From the
**Add Data Sources**dialog, click the**Project**button (if it is not already selected), select the**Regression Results – Corrected for SBP**spreadsheet and click**-log10 FvR Model P**. The**Add Data Sources**dialog should look like. Press**Figure 5****Plot & Close**.

- Now change the color of the new data points by clicking the
**-log10 FvR Model P**node and under the**Style**tab click the blue box to the right of*Style:*and change it to green.

Now, in * Figure 6*, we only see one area of interest in green compared to the two blue regions that showed high –log10 P values. The genetic region that was significant in the previous analysis is probably also associated with high blood pressure and was thus confounding the results. The region that shows an association has even larger –log10 P values after correcting for this confounding variable.

## 3. Accounting for SNP Correlation

If the regression analysis is to use the entire genome as predictors, then it is recommended to select the “Use a moving window of regressors” option. A moving window incorporates consecutive SNPs, which might be in high linkage disequilibrium, providing information not captured with single variable regression. Consecutive SNPs can also be incorporated by detecting Haplotype blocks to use for regression, or by using Haplotype Trend Regression.

For this analysis, we will use a dynamic moving window the size of which is determined according to a fixed number of base pairs.

### Using a Moving Window

As SBP is a known confounding variable, it is included as a reduced model covariate for this analysis.

- Open the
**Pheno + Geno – Sheet 1**spreadsheet. The**Case/Control**column should still be selected as the dependent variable. Select**Numeric > Numeric Regression Analysis**to bring up the Regression window. - This time choose the
**Regress on a moving window with parameters:**radio button under**Selection Parameters**. Then choose**Dynamic window over 40083 mapped numeric columns**with size**10000**base pairs and with**max markers: 20**checked. - Check
**Correct for covariate(s)**under**Size: ….**and add**SBP**as a**Reduced Model Covariate**in the same manner as before. Confirm that your options match those inand click*Figure 7***Run**.

- From the
**Project Navigator**rename the**Regression Results**to**Regression Results – Moving Window**in the same way as before. - In the
**Regression Results – Moving Window**spreadsheet, right-click on the**–log10 FvR Model P**column (column 2) and choose**Plot Variable in GenomeBrowse**. Then once again zoom into Chromosome 1.

As you can see from * Figure 8*, this plot is similar to the previous corrected regression. However, we see that the peak around cytoband 1q32.2 has a maximum –log10 P-value of around 8 on the y-axis, compared to before when it had a maximum of more than 10.

## 4. Finding SNP/Environment Interactions

SVS may also be used to determine if there is an interaction between a covariate and any other covariates. For this analysis, we will test to see if weight interacts with any SNPs to effect a change in our dependent variable. We will also simultaneously correct for SBP, as it is a known confounding variable.

- Open the
**Pheno + Geno – Sheet 1**spreadsheet. The**Case/Control**column should still be selected as the dependent variable. Select**Numeric > Numeric Regression Analysis**to bring up the Regression window. - This time choose the
**Regress on covariate-column interactions (on 40095 numeric cols)**radio button under**Selection Parameters**. The full-model section on the upper right will change its name to**Covariate-Column Interactions (Full Model)**. In this section, press the**Add Col Interaction**button, select the covariate**Weight (Lbs)**, and press**Add**. - Under
**Regress on covariate-column interactions (on 40095 numeric cols)**, select**Correct for additional covariate(s)**. Add**SBP**as a**Reduced Model Covariate**in the same manner as before.

Notice, in * Figure 9*, that the model actually being used is explained by the display in the upper-right section and the display in the lower-right section. (You may need to widen the options window to see this displayed completely.) The one full-model covariate in each regression is the interaction between weight and the data of the current numeric column. The reduced model corrects not only for SBP, but for both the weight and the current numeric column as individual covariates.

- Next, open the
**Output Parameters**tab and check**Bonferroni adjustment (on N covariates)**and**False Discovery Rate (FDR)**. - Finally, confirm that your options match those in
and**Figure 9**and click**Figure 10****Run**. - From the
**Project Navigator**rename the**Regression Results**to**Regression Results – CC Interactions**in the same way as before. - In the
**Regression Results – CC Interactions**spreadsheet, right-click on the**-log10 FvR Model P**column (column 2) and choose**Plot Variable in GenomeBrowse**. Then once again zoom into chromosome 1.

As can be seen from * Figure 11*, we do have some (mild) interactions, with a -log10 P-value of up to a little more than 5.

We may ask: Are any of these interactions significant on a chromosome-wide or genome-wide scale?

- In the
**Regression Results – CC Interactions**spreadsheet, right-click on the**FvR Model P-Value**column (column 1) and select**Sort Ascending**. - Observe the results for
**SNP_A-1795780**(row 10), which had the best full-vs-reduced-model p-value, namely 6.541 x 10^{-6}. This would be significant by itself, and it seems like it could be at least chromosome-wide significant–however, to know for sure, let’s check the results of the multiple-test correction procedures. - Scroll to the right until you reach the
**Bonferroni P**and**FDR**columns (columns 26 and 27). We see that for**SNP_A-1795780**, both the Bonferroni adjustment and the False Discovery Rate are 0.262, less significant than the 0.05 we would need for chromosome-wide (or genome-wide) significance. - Right-click on the
**FDR**column and select**Sort Ascending**. We see that no other marker interaction has an FDR even as small as that for the interaction with**SNP_A-1795780**.

Therefore, we have not found any interaction with **Weight (Lbs)** that was significant on either a chromosome-wide or a genome-wide scale.

## 5. Using the Regression Module for Model Comparison

The SVS regression module can also be used for model selection. This is a more classical way of thinking about regression, namely determining the added effect of individual additional covariates using a full vs. reduced model test.

- Open the
**Pheno + Geno – Sheet 1**spreadsheet and select**Numeric > Numeric Regression Analysis**to bring up the**Regression window**. - This time choose
**Perform single regression with selected covariates**under**Selection Parameters**.

First, we note that for all regressions we have demonstrated thus far, any covariates selected or shown in the **Reduced Model Covariates** box are actually included in both the reduced and the full model of any regression, while the covariates selected or shown in the **Full Model Covariates** box are only included in the full model. Therefore, running a full vs. reduced model actually tests the additional effects of the **Full Model Covariates** over and above the effects of the **Reduced Model Covariates**.

For this example, we will do exactly one regression to find the additional effect that change in diastolic blood pressure (**Chng in Dbp**) has over and above the effects of covariates **SBP** and **Previous Event**.

- Check
**Correct for covariate(s)**and then under**Reduced Model Covariates**choose**Add Covariate**. Check both**SBP**and**Previous Event**and then click**Add**. - Next, under
**Full Model Covariates**, add**Chng in Dbp**in the same manner as the**Reduced Model Covariates**were added. Make sure the window looks likeand click**Figure 12****Run**.

This will test whether **Chng in Dbp** improves the fit of the model by explaining more of the dependent variable response after accounting for **SBP** and **Previous Event**. More specifically, the following models are used in the test:

**Full Model:**

**Reduced Model:**

And the hypotheses are **H0**:

and **Ha**:

The **Regression Statistics Viewer** (**Figure 13**) shows a P-value of about 0.013 for the full vs. reduced model. This means that **Chng in Dbp** is useful in predicting **Case/Control** status after accounting for **SBP** and **Previous Event**. For more information about the additional outputs in the **Regression Statistics Viewer**, see the Regression Analysis chapter in the SVS manual.