Welcome to the SVS Genomic Prediction with K-Fold Tutorial!
Updated: March 8, 2021
Version: 8.7.0 or higher
K-Fold cross validation is a model validation technique that can be used to assess how well a model can predict a phenotype for a given independent dataset. To evaluate our model, we use samples for which we have both genotype and phenotype data.
To prepare for k repetitions of the cross-validation procedure (“K-Fold”), SVS partitions our data into k subsets of samples, which we shall call subsamples. During each repetition, or “fold”, one subsample is selected to be the validation set while the model is fitted to the other k-1 subsamples. (In a sense, this samples k times “without replacement”.) Using the combined results from the k repetitions we can analyze how well the model was able to predict the phenotypes for our dataset and how well it can predict an independent data set.
Subsamples are sampled with either simple random sampling or, if there is a subpopulation grouping present, stratified random sampling.
Several variations are available for this procedure. Along with selecting model parameters and the value of k, it is possible to select how often to repeat the above procedure, starting with partitioning the data into k subsets. Additionally, an option is available that is equivalent to sampling “with replacement” for the validation set.
Using a Soybean dataset, this tutorial will lead you step-by-step through the workflow for determining whether a prediction model is appropriate.
This tutorial will not cover the data importing and quality control procedures that were used. To learn more about the various quality control tools available in SNP & Variation Suite (SVS) , please refer to the quality control section of the manual.
To complete this tutorial you will need to download the following:
We hope you enjoy the experience and look forward to your feedback.
1. Overview of the Project
- From the main SVS welcome screen, select File > Open Project. Navigate to the Genomic Prediction with K-Fold Tutorial folder you just downloaded and select the Genomic Prediction with K-Fold Tutorial.ghp file. Click Open to finish. You will see the project navigator view (Figure 1-1).
The main data spreadsheet named SimsPhens + SoyBeanGWAS – Sheet 1, which has mapped genotypic information and four phenotype columns, will be used to complete the analysis. (Figure 1-2.)
For this tutorial, we will predict the second phenotype, Phen2, and use the fourth phenotype, Phen4, for our sampling technique.
The project already contains the analysis output from the Bayes C-pi portion of the K-Fold run. In the next section we will complete the GBLUP portion of the analysis and discuss the output from both tools.
NOTE: Due to its iterative nature, the Bayes C-pi algorithm can take a lot more time to finish than GBLUP. Total time for both tools to run consecutively in this project can be about 3 hours, depending on your computer specification.
2. Performing K-Fold Cross Validation
Running GBLUP Analysis
- To run K-Fold for GBLUP, open SimPhens + SoyBeanGWAS – Sheet 1 and select Genotype > K-Fold Cross Validation (for Genomic Prediction).
- A second spreadsheet will automatically be created, SimPhens + SoyBeanGWAS Numeric Genotypes Additive (Major/Minor used). This spreadsheet contains numerical representations of the genotypes.
- For this tutorial, choose Genomic Best Linear Unbiased Predictors (GBLUP) under Method(s). Check Stratify Folds by and select Phen4. Under K-Fold Options set the Number of Iterations to 2.
NOTE: The options selected for the Bayes C-pi analysis are also seen in Figure 2-1 below. In particular, the Number of Iterations was set to 1000. This smaller number of iterations was selected for expediency. To obtain more accurate results, it is advised to select a larger number of iterations.
- Click OK to begin the analysis.
The K-Fold method will first create our subsamples using the stratified random sampling method. A subsample will be created for each of the k folds in such a way as to ensure that each stratum is equally represented.
Consider the following dataset (Figure 2-2) that is a portion of the soybean data from the tutorial project.
We have 60% of our samples in group 0 and 40% in group 1 based on Pheno4, so for each subsample, we should expect about the same proportions of samples in each group. If we look at the red subsample set, we have a total of four samples with 50% in group 0 and 50% in group 1.
In our example, we selected five folds, so we created five subsamples represented by the color groups (red, yellow, green, blue and black). For each fold, we select one of the subsamples (color groups) to be the validation set and the rest of the samples to be the training set. We continue until each subsample has been the validation set for a fold.
For more information about the K-Fold algorithm, see the K-Fold Cross Validation section of the SVS Manual.
Information on the GBLUP and Bayesian methods can found in the Methods for Mixed Linear Model Analysis section of the SVS Manual.
Examining the GBLUP Results
When K-Fold Cross Validation finishes there should be results for GBLUP and Bayes C Pi for two iterations and five folds in the project. At the end of each iteration there will be several Final Results spreadsheets and a Final Results viewer, and at the end of all the iterations there will be an Iteration Summary Statistics spreadsheet and an Iteration Summary Statistics viewer.
For each GBLUP fold there will be four to five spreadsheets.
GBLUP – Fold # – Iteration # (Figure 2-3)
GBLUP Genomic Relationship Matrix (Figure 2-4)
This will only be created for the first fold and only if a precomputed matrix was not selected in the options dialog.
GBLUP estimates by marker – Fold # (Figure 2-5)
Genomic value plots from the Normalized Abs ASE can be used to visualize the contribution of each loci to the model.
- Right-click on the Normalized Abs ASE column and select Plot Variable in GenomeBrowse. (See Figure 2-6.) You can do this for the output from any fold or from the final results of the analysis.
GBLUP fixed effect coefficients – Fold # (Figure 2-7)
This spreadsheet lists the intercept and any fixed effects included in the model. For this example, we didn’t add fixed effects in the options dialog, so we only see the intercept. The Reference Covariate? is 0 since the intercept is not categorical. (This column is used to distinguish categorical covariates that are reference covariates.)
GBLUP estimates by sample – Fold # (Figure 2-8)
This spreadsheet contains three columns–actual phenotypes, predicted phenotypes and random effect components (gEBV). We can visually compare the actual and predicted phenotypes by creating a scatter plot.
- Click on the Actual Phenotype column to make it a dependent variable, then right click on the Predicted phenotype column and select Plot X: Predicted phenotype vs Y:Dependent(s). This will create a plot under this spreadsheet called Plots from GBLUP estimates by sample – Fold # – Iteration 1 against Predicted phenotype. (Figure 2-9)
This graph shows evidence of a positive correlation between the actual and predicted phenotypes. Below, we’ll look at the GBLUP Summary Statistics where we can look at the per fold and overall statistics to draw more conclusions on the strength of this model.
At the end of each iteration, GBLUP will have four outputs–three spreadsheets and one results viewer.
GBLUP Final Results – Iteration # (Figure 2-10)
This spreadsheet contains the original phenotypes, predicted phenotypes and an indicator variable that states in which fold a sample’s phenotype was predicted. (In that fold, the phenotype was part of the validation set and was thus set to missing.)
GBLUP – ASE – Iteration # (Figure 2-11)
The Allele substitution effect (ASE) column contains the average ASE values from each fold. The absolute values and normalized absolute values are calculated in the same manner as the per fold results.
GBLUP – Fixed Effect Coefficients – Iteration # (Figure 2-12)
GBLUP Summary Statistics – Iteration # (Figure 2-13)
This viewer contains statistics for each fold and the overall results.
For more information about the statistics, see the K-Fold Statistics section of the SVS Manual.
Examining the Bayes C-pi Results
The Bayesian methods produce similar spreadsheets except that the Bayes C Pi estimates by sample – Fold # spreadsheet contains columns for adjusted phenotypes. These are versions of the phenotypes that have been standardized to prevent the error variance from becoming too small (and thus being represented as zero due to the limitations of floating point numbers on computers).
There is also an additional trace spreadsheet which lists the values for each statistic for each iteration. These values are also displayed in a graph (Figure 2-14) based on this spreadsheet.
Trace plots can be used for a couple of diagnostics in Markov Chain Monte Carlo techniques such as the Bayes C/C Pi methods. We can look for rapid mixing, which is the distribution reaching a stabilized state where the sampled values hover around the true parameter value. We can also look for an appropriate burn-in period, or at what number of iterations the sampled values seem to converge on the true parameter. (See Figure 2-15.)
Another use of the Bayes C Pi Trace Spreadsheet – Fold # – Iteration # spreadsheet is to create autocorrelation plots.
- To create these, open the Bayes C Pi Trace Spreadsheet – Fold 1 – Iteration 1 spreadsheet, click on Plot and select Autocorrelation Plots.
- Add all four columns and check all the options under Additional Options.
Autocorrelation plots (Figure 2-16) show us if one value of a parameter influences the next sampled value. Values closer to +/- 1 indicate a strong correlation. We have also included 95% confidence intervals to help us interpret whether we have significant correlation.
If there is significant autocorrelation, a possible remedy is to use thinning when running Bayes C-pi. This is where we keep only every x-th iteration to try to reduce the number of iterations that influence the following iterations. However, it has been debated how effective thinning is, and whether it is better to run a larger chain or to simply identify whether or not this model is really appropriate for your dataset.
After K-Fold has finished, examination of the results will help us determine which models are a good fit and if we need to adjust the algorithm options to better fit our data.
Since we performed multiple iterations, we can find the average of the statistics in the GBLUP Iteration Summary Statistics viewer (Figure 3-1).
For our GBLUP results we had a Pearson’s Product-Moment Correlation Coefficient of 0.2354844483, which doesn’t support a very strong correlation between the predicted and actual phenotypes. Further evidence that the GBLUP model might not be the best fit is that the Residual Sum of Squares and Total Sum of Squares are quite high.
For a list of all the quantitative statistics and their formulas, please see the K-Fold Quantitative Statistics section of the SVS Manual.
Bayes C Pi Results
For Bayes C Pi, when we look at the Bayes C Pi Iteration Summary Statistics results viewer (Figure 3-2) we see a similar Pearson’s Product-Moment Correlation Coefficient value. The other statistics also have similar values to those from GBLUP. So, at this point, we don’t have a clear “winner” between the two models.
For the Bayes methods, we have additional graphs to examine to determine whether we should adjust the options when running Bayes C Pi to better fit our data.
The 1000 iterations we completed were probably not enough to determine whether this model is a good fit for our data. Below (and also in the previous section) are trace plots (Figure 3-3) and autocorrelation plots (Figure 3-4) of a 25,000-iteration Bayes C Pi run.
If we look at the Pi statistic we can see the value of Pi decreasing, then staying somewhat stable from around iteration 12,000 and afterwards. If we look at the autocorrelation plot, we see a relatively high autocorrelation at the beginning and end of the iterations.
The large autocorrelation values at the end suggest that the Pi values between each iteration are fairly similar, which might suggest that we don’t really need to go through this many iterations, since the Pi value has settled on a value and is not changing much, basically wasting effort.
We can also use the two plots to inform us about a potential burn-in period. We see that at around 5000 iterations the Pi value gets away from the seemingly high Pi value of 0.5. We also see in the autocorrelation plot that the autocorrelation value goes down significantly and stays within the 95% confidence bounds.
Please see the Autocorrelation Plots section of the SVS Manual.
A suggested parameter adjustment for Bayes C Pi, based on these graphs, is to set the number of iterations to 12,000 and the burn-in to 5,000 to see if this improves the summary statistics.