1. Home
  2. SVS
  3. SVS Tutorials
  4. Advanced Tutorials
  5. SVS Genomic Prediction with K-Fold Tutorial

SVS Genomic Prediction with K-Fold Tutorial

Welcome to the SVS Genomic Prediction with K-Fold Tutorial!

_images/ScatterPlotGBLUPFinal.jpg

Updated: March 8, 2021

Level: Advanced

Version: 8.7.0 or higher

Product: SVS

K-Fold cross validation is a model validation technique that can be used to assess how well a model can predict a phenotype for a given independent dataset. To evaluate our model, we use samples for which we have both genotype and phenotype data.

To prepare for k repetitions of the cross-validation procedure (“K-Fold”), SVS partitions our data into k subsets of samples, which we shall call subsamples. During each repetition, or “fold”, one subsample is selected to be the validation set while the model is fitted to the other k-1 subsamples. (In a sense, this samples k times “without replacement”.) Using the combined results from the k repetitions we can analyze how well the model was able to predict the phenotypes for our dataset and how well it can predict an independent data set.

Subsamples are sampled with either simple random sampling or, if there is a subpopulation grouping present, stratified random sampling.

Several variations are available for this procedure. Along with selecting model parameters and the value of k, it is possible to select how often to repeat the above procedure, starting with partitioning the data into k subsets. Additionally, an option is available that is equivalent to sampling “with replacement” for the validation set.

Using a Soybean dataset, this tutorial will lead you step-by-step through the workflow for determining whether a prediction model is appropriate.

Note

This tutorial will not cover the data importing and quality control procedures that were used. To learn more about the various quality control tools available in SNP & Variation Suite (SVS) , please refer to the quality control section of the manual.

Requirements

To complete this tutorial you will need to download the following:

SVS_Genomic_Prediction_With_K_Fold.zip

We hope you enjoy the experience and look forward to your feedback.

1. Overview of the Project

  • From the main SVS welcome screen, select File > Open Project. Navigate to the Genomic Prediction with K-Fold Tutorial folder you just downloaded and select the Genomic Prediction with K-Fold Tutorial.ghp file. Click Open to finish. You will see the project navigator view (Figure 1-1).
Project Navigator View
Figure 1-1: Project Navigator View

The main data spreadsheet named SimsPhens + SoyBeanGWAS – Sheet 1, which has mapped genotypic information and four phenotype columns, will be used to complete the analysis. (Figure 1-2.)

Soybean data set with genotypic and phenotypic data.
Figure 1-2: Soybean Spreadsheet

For this tutorial, we will predict the second phenotype, Phen2, and use the fourth phenotype, Phen4, for our sampling technique.

The project already contains the analysis output from the Bayes C-pi portion of the K-Fold run. In the next section we will complete the GBLUP portion of the analysis and discuss the output from both tools.

NOTE: Due to its iterative nature, the Bayes C-pi algorithm can take a lot more time to finish than GBLUP. Total time for both tools to run consecutively in this project can be about 3 hours, depending on your computer specification.

2. Performing K-Fold Cross Validation

Running GBLUP Analysis

  • To run K-Fold for GBLUP, open SimPhens + SoyBeanGWAS – Sheet 1 and select Genotype > K-Fold Cross Validation (for Genomic Prediction).
  • A second spreadsheet will automatically be created, SimPhens + SoyBeanGWAS Numeric Genotypes Additive (Major/Minor used). This spreadsheet contains numerical representations of the genotypes.
  • For this tutorial, choose Genomic Best Linear Unbiased Predictors (GBLUP) under Method(s). Check Stratify Folds by and select Phen4. Under K-Fold Options set the Number of Iterations to 2.

NOTE: The options selected for the Bayes C-pi analysis are also seen in Figure 2-1 below. In particular, the Number of Iterations was set to 1000. This smaller number of iterations was selected for expediency. To obtain more accurate results, it is advised to select a larger number of iterations.

  • Click OK to begin the analysis.
K-Fold dialog with selected options
Figure 2-1: K-Fold Dialog Options

The K-Fold method will first create our subsamples using the stratified random sampling method. A subsample will be created for each of the k folds in such a way as to ensure that each stratum is equally represented.

Illustrative Example

Consider the following dataset (Figure 2-2) that is a portion of the soybean data from the tutorial project.

Stratified Sampling Example
Figure 2-2: Stratified Sampling Example

We have 60% of our samples in group 0 and 40% in group 1 based on Pheno4, so for each subsample, we should expect about the same proportions of samples in each group. If we look at the red subsample set, we have a total of four samples with 50% in group 0 and 50% in group 1.

In our example, we selected five folds, so we created five subsamples represented by the color groups (red, yellow, green, blue and black). For each fold, we select one of the subsamples (color groups) to be the validation set and the rest of the samples to be the training set. We continue until each subsample has been the validation set for a fold.

For more information about the K-Fold algorithm, see the K-Fold Cross Validation section of the SVS Manual.

Information on the GBLUP and Bayesian methods can found in the Methods for Mixed Linear Model Analysis section of the SVS Manual.

Examining the GBLUP Results

When K-Fold Cross Validation finishes there should be results for GBLUP and Bayes C Pi for two iterations and five folds in the project. At the end of each iteration there will be several Final Results spreadsheets and a Final Results viewer, and at the end of all the iterations there will be an Iteration Summary Statistics spreadsheet and an Iteration Summary Statistics viewer.

For each GBLUP fold there will be four to five spreadsheets.

GBLUP – Fold # – Iteration # (Figure 2-3)

GBLUP Fold spreadsheet with validation set to missing
Figure 2-3: Genotype spreadsheet with validation set for fold 1 set to missing

GBLUP Genomic Relationship Matrix (Figure 2-4)

Genomic Relationship Matrix
Figure 2-4: Genomic Relationship Matrix

This will only be created for the first fold and only if a precomputed matrix was not selected in the options dialog.

GBLUP estimates by marker – Fold # (Figure 2-5)

Estimates by Marker
Figure 2-5: Estimates by Marker

Genomic value plots from the Normalized Abs ASE can be used to visualize the contribution of each loci to the model.

  • Right-click on the Normalized Abs ASE column and select Plot Variable in GenomeBrowse. (See Figure 2-6.) You can do this for the output from any fold or from the final results of the analysis.
Normalized Abs ASE
Figure 2-6: Contribution of Each Locus to the ASE results

GBLUP fixed effect coefficients – Fold # (Figure 2-7)

Fixed effect coefficients
Figure 2-7: Fixed effect coefficients

This spreadsheet lists the intercept and any fixed effects included in the model. For this example, we didn’t add fixed effects in the options dialog, so we only see the intercept. The Reference Covariate? is 0 since the intercept is not categorical. (This column is used to distinguish categorical covariates that are reference covariates.)

GBLUP estimates by sample – Fold # (Figure 2-8)

Estimates by sample
Figure 2-8: Estimates by Sample

This spreadsheet contains three columns–actual phenotypes, predicted phenotypes and random effect components (gEBV). We can visually compare the actual and predicted phenotypes by creating a scatter plot.

  • Click on the Actual Phenotype column to make it a dependent variable, then right click on the Predicted phenotype column and select Plot X: Predicted phenotype vs Y:Dependent(s). This will create a plot under this spreadsheet called Plots from GBLUP estimates by sample – Fold # – Iteration 1 against Predicted phenotype. (Figure 2-9)
Regression plot between actual and predicted phenotypes
Figure 2-9: Scatter plot between actual and predicted phenotypes

This graph shows evidence of a positive correlation between the actual and predicted phenotypes. Below, we’ll look at the GBLUP Summary Statistics where we can look at the per fold and overall statistics to draw more conclusions on the strength of this model.

At the end of each iteration, GBLUP will have four outputs–three spreadsheets and one results viewer.

GBLUP Final Results – Iteration # (Figure 2-10)

Compiled predicted phenotype values from each fold
Figure 2-10: Predicted Phenotypes from each fold compiled into a final Predicted Phenotype column

This spreadsheet contains the original phenotypes, predicted phenotypes and an indicator variable that states in which fold a sample’s phenotype was predicted. (In that fold, the phenotype was part of the validation set and was thus set to missing.)

GBLUP – ASE – Iteration # (Figure 2-11)

Average ASE values over all folds
Figure 2-11: Average ASE values over all folds

The Allele substitution effect (ASE) column contains the average ASE values from each fold. The absolute values and normalized absolute values are calculated in the same manner as the per fold results.

GBLUP – Fixed Effect Coefficients – Iteration # (Figure 2-12)

Average fixed effect values over all folds
Figure 2-12: Average Fixed Effect values over all folds

GBLUP Summary Statistics – Iteration # (Figure 2-13)

Statistics for each fold and overall
Figure 2-13: Statistics for each fold and the final results spreadsheets

This viewer contains statistics for each fold and the overall results.

For more information about the statistics, see the K-Fold Statistics section of the SVS Manual.

Examining the Bayes C-pi Results

The Bayesian methods produce similar spreadsheets except that the Bayes C Pi estimates by sample – Fold # spreadsheet contains columns for adjusted phenotypes. These are versions of the phenotypes that have been standardized to prevent the error variance from becoming too small (and thus being represented as zero due to the limitations of floating point numbers on computers).

There is also an additional trace spreadsheet which lists the values for each statistic for each iteration. These values are also displayed in a graph (Figure 2-14) based on this spreadsheet.

Trace Plots
Figure 2-14: Trace Plots for pi, effects variance, number of markers included, and error variance

Trace plots can be used for a couple of diagnostics in Markov Chain Monte Carlo techniques such as the Bayes C/C Pi methods. We can look for rapid mixing, which is the distribution reaching a stabilized state where the sampled values hover around the true parameter value. We can also look for an appropriate burn-in period, or at what number of iterations the sampled values seem to converge on the true parameter. (See Figure 2-15.)

Burn-in period
Figure 2-15: Suggestion for an appropriate burn-in period

Another use of the Bayes C Pi Trace Spreadsheet – Fold # – Iteration # spreadsheet is to create autocorrelation plots.

  • To create these, open the Bayes C Pi Trace Spreadsheet – Fold 1 – Iteration 1 spreadsheet, click on Plot and select Autocorrelation Plots.
  • Add all four columns and check all the options under Additional Options.
Autocorrelation Plot
Figure 2-16: Autocorrelation Plot

Autocorrelation plots (Figure 2-16) show us if one value of a parameter influences the next sampled value. Values closer to +/- 1 indicate a strong correlation. We have also included 95% confidence intervals to help us interpret whether we have significant correlation.

If there is significant autocorrelation, a possible remedy is to use thinning when running Bayes C-pi. This is where we keep only every x-th iteration to try to reduce the number of iterations that influence the following iterations. However, it has been debated how effective thinning is, and whether it is better to run a larger chain or to simply identify whether or not this model is really appropriate for your dataset.

3. Interpretation

After K-Fold has finished, examination of the results will help us determine which models are a good fit and if we need to adjust the algorithm options to better fit our data.

GBLUP Results

Since we performed multiple iterations, we can find the average of the statistics in the GBLUP Iteration Summary Statistics viewer (Figure 3-1).

GBLUP Iteration Summary Statistics
Figure 3-1: GBLUP Iteration Summary Statistics

For our GBLUP results we had a Pearson’s Product-Moment Correlation Coefficient of 0.2354844483, which doesn’t support a very strong correlation between the predicted and actual phenotypes. Further evidence that the GBLUP model might not be the best fit is that the Residual Sum of Squares and Total Sum of Squares are quite high.

For a list of all the quantitative statistics and their formulas, please see the K-Fold Quantitative Statistics section of the SVS Manual.

Bayes C Pi Results

For Bayes C Pi, when we look at the Bayes C Pi Iteration Summary Statistics results viewer (Figure 3-2) we see a similar Pearson’s Product-Moment Correlation Coefficient value. The other statistics also have similar values to those from GBLUP. So, at this point, we don’t have a clear “winner” between the two models.

Bayes C Pi Iteration Summary Statistics
Figure 3-2: Bayes C Pi Iteration Summary Statistics

For the Bayes methods, we have additional graphs to examine to determine whether we should adjust the options when running Bayes C Pi to better fit our data.

The 1000 iterations we completed were probably not enough to determine whether this model is a good fit for our data. Below (and also in the previous section) are trace plots (Figure 3-3) and autocorrelation plots (Figure 3-4) of a 25,000-iteration Bayes C Pi run.

Bayes C Pi Trace Plots
Figure 3-3: Bayes C Pi Trace Plots
Bayes C Pi Autocorrelation Plot for pi
Figure 3-4: Bayes C Pi Autocorrelation Plot for Pi

If we look at the Pi statistic we can see the value of Pi decreasing, then staying somewhat stable from around iteration 12,000 and afterwards. If we look at the autocorrelation plot, we see a relatively high autocorrelation at the beginning and end of the iterations.

The large autocorrelation values at the end suggest that the Pi values between each iteration are fairly similar, which might suggest that we don’t really need to go through this many iterations, since the Pi value has settled on a value and is not changing much, basically wasting effort.

We can also use the two plots to inform us about a potential burn-in period. We see that at around 5000 iterations the Pi value gets away from the seemingly high Pi value of 0.5. We also see in the autocorrelation plot that the autocorrelation value goes down significantly and stays within the 95% confidence bounds.

Please see the Autocorrelation Plots section of the SVS Manual.

A suggested parameter adjustment for Bayes C Pi, based on these graphs, is to set the number of iterations to 12,000 and the burn-in to 5,000 to see if this improves the summary statistics.

Updated on March 22, 2021

Was this article helpful?

Related Articles

Leave a Comment