Hello and welcome to Golden Helix's final webcast for 2019. My name is Delaina Hawkins and I am the Director of Marketing and Business Development, and it is my pleasure to introduce today's presenter, my colleague Julia Love, who is one of Golden Helix's Field Application Scientists. Julia, thank you so much. It's wonderful to have you with us today.
Hi Delaina. Thanks for having me. And thank you all for joining the webcast.
Before we begin, I would just like to remind our audience of our Q&A process. So one of the benefits of joining these live events is the opportunity to ask our presenters questions. So if you have any questions today, please enter those onto the questions panel, the photo on your screen here shows where you'll be able to find this. And then after Julia's finished with her presentation, we'll be answering all your questions. So that's it for now. I will be back later to cover some exciting events happening here at Golden Helix. But for now, I will hand things over to you, Julia, and let you get started.
So today what I'll be covering is I'll be going through a GWAS workflow, including some various quality control metrics. But also I'm going to follow that up with an association test with genomic prediction workflow, where we're going to compare two methods used for genomic prediction all within SVS. But before I jump into the demonstration, I first and foremost want to mention that we recently received grant funding from the NIH which we are incredibly grateful for. And the research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institute of Health under these listed awards here. Additionally, we're also grateful for receiving local grant funding from the state of Montana. Our P.I. is Dr. Andreas Scherer, who is the CEO here at Golden Helix. And of course, the content described today is the responsibility of the authors and does not officially represent the views of the NIH. But now let's go ahead and learn more about Golden Helix as a company.
So Golden Helix is a global bioinformatics software company that enables research and clinical practices to analyze large genomic datasets. We were originally founded in 1998 based on the pharmacogenomics work performed at GlaxoSmithKline, who was and still is a primary investor in our company.
We currently have two flagship products VarSeq and SNP and Variation Suite, or SVS, which is gonna be the focus of the webcast today. But SVS is our research platform and it enables researchers to perform complex analyses and visualizations on both genomic and phenotype data. And SVS has a range of tools to perform a wide variety of analyses, such as GWAS, genomic prediction, and CNV analysis as well, just to name a few. But we also do have our clinical analysis tool VarSeq, which is tailored for filtering and annotating variants. Additionally, users have access to automated AMP or ACMG variant guidelines as well as the capability to detect copy number variations. Additionally, the finalization of your variant interpretation and classification is further optimized with the VSClinical reporting capability. Users can integrate all of these features into a standardized workflow which can be automated even more with VSPipeline. Paired with VarSeq is VSWarehouse, which serves as the central repository for all of the large amounts of genomic data that you're collecting. But it also does hold your reports and assessment catalogs as well. In VSWarehouse, all of the data is fully queryable and can be shared and accessed easily among users.
Our software has been cited in 1,000s of peer-reviewed publications such as Science and Nature and has been very well received by the industry. In fact, we work with over 400 organizations all over the globe, top tier institutions like Stanford and Yale, government organizations like NCI, but also in several clinics and genetic testing labs as well. And we now have well over 20,000 installs of our products and with 1,000s of unique users. So why is this relevant to you? This means that over the course of 20 years, our products have received a lot of user feedback, which we immediately try to incorporate into developing and releasing newer versions of our products. We receive active research grants to support the advancement of our software capability, which is always directed from our user feedback and awareness of the industry needs. We also stay relevant in our community by regularly attending conferences and providing useful product information via our online eBooks, tutorials, and blog posts. So your access to the software is a simple subscription based model where we don't charge per sample or per-version. And with that subscription, you also get full access to our support and training staff so you can get up to speed quickly with your analysis.
The Golden Helix full-stack provides the capability to start with a FASTQ file and get all the way down to your clinical report. This is achievable through a partnership with Sentieon who provides the alignment and variant calling steps to produce VCF and BAM files. And these outputs serve as the basis for importing data into VarSeq for all your tertiary analysis. If you happen to be performing NGS based CNV analysis, Golden Helix is the market leader. Additionally, the imported variants into your VarSeq project can be run through VSClinical automated ACMG and AMP guidelines. After you complete all of your secondary and tertiary processing, all analyses can be rendered into a clinical report which you can store in VSWarehouse, which provides researchers and clinicians with access to that information and to review any previous findings.
So we talked a little bit about the various software that we have. But like I said today, we're going to focus on SVS. So a little more background on SVS is that it was created specifically for genetic research and was initially developed in the GWAS era. So our goal is when creating SVS was to have high performance but easy to use software with rich visualization tools that would make SVS an obvious and popular choice for genetic researchers. Since then, as the world of genetic research has evolved, we have added numerous tools to account for a variety of workflows. SVS now includes GWAS and genomic prediction, rare variant burden analysis, collapsing methods, traditional NGS CNV analysis, imputation, and DNA sequencing workflows. Although there also is some limited RNA-seq capabilities as well. But all of these methodologies available in SVS are accompanied with an incredible and intuitive visualization powered by GenomeBrowse. So most of GenomeBrowse features we will be looking at in SVS are going to be involved in visualizing genomic plots. For today, since we're going to be looking at GWAS GenomeBrowser will produce a publication ready Manhattan plots, but also linkage disequilibrium plots as well. So an approximate agenda for today is we're going to first start with a GWAS workflow, and we're gonna go over some options that are available for running the association test as well as go over some quality control measures as well. And we're going to follow that up with genomic prediction with k-fold. And so we're gonna go over a background as well as options for running genomic predictions with our models in SVS. And then we're gonna go over the results of both our GWAS and genomic prediction in an example project. And then we will definitely leave some time at the end for questions.
So the data that I'm going to be using today is from the Bovine HapMap project with the Illumina 50k genotypes with simulated phenotypes. And the dataset consists of 472 Bos Taurus samples that we will first use in our GWAS workflow. And we're gonna go through any necessary quality control steps to run our association test and produce those graphics that are really great for your publications. An additional application for our data set is going to be to carry out genomic prediction on a phenotype of interest. And so we'll want to see how well our existing samples with their genetic and phenotypic information will predict the genotype required for our phenotype of interest.
So why GWAS? GWAS it's been around for a while now, coming up on 20 years and GWAS was initially designed to study the human genome and has really since then been the go-to technology for gene finding research. However, as GWAS involved and was used, more and more non-human applications have emerged, particularly allowing researchers to identify and map SNPs that underlie desirable traits in agrigenomics. And with the availability of high throughput genotyping arrays for a variety of animals and crop species, scientists can now improve breeding programs or food production through genetics. And so with that, I'd like to actually take a quick poll and ask all of you what species you guys are analyzing in your datasets. Is that going to be plants, animals, human or others? And I'll give you a few seconds here to answer that.
All right, so it looks like most of you are working with animals or humans, a couple of plant people out there, but ultimately, no matter what species you are working with, GWAS can be considered the first step towards understanding the architecture of traits and ideally GWAS will result in SNPs that are associated with your trait of interest. And right here in this Manhattan plot here, it shows this concept quite nicely, as you can see, all the SNPs across all the chromosomes and that perhaps there's a nice group of SNPs here that you can then evaluate for functional and phenotype significance. But now what I want to take a look at is the GWAS workflow in a bit more detail before we hop into the software.
So once we have our data imported into SVS, we want to make sure that we have adequately prepared our data because we don't want anything influencing the result of our association test. So let's go ahead and review some of the quality control measures that are available in SVS that we should perhaps implement in our study. So we can ask ourselves a few questions to ensure that we have high quality samples and markers prior to running our association test. The first question we might ask is do we have any outliers in our data? This is both on the marker and sample level. And to answer this question, SVS provides a convenient option to calculate marker and sample statistics and provides advanced plotting tools to help us answer this question. A second question we might want to ask is, is there any relatedness among our samples? And in SVS there are a few ways to calculate a genomic relationship matrix to help us answer that question. And so with that, I am about to hop into the software. But remember, if you do, how many questions, please enter them into the chat panel as we go.
OK, so for those of you who may not have been exposed to SVS before, I thought I'd just give a simple breakdown of the layout of what you would see with any SVS project. So here in the left is our Project Navigator window. And what you're going to see here is the collection of all your various spreadsheets and outputs of your plots and algorithms, and they're all going be generated right here. And accompanying the Project Navigator window is the no change log. And so any changes that you make to elements in the project here in the navigator window, they're going to be reported here and it's going to log what that change was, who made the change and when. And then the last component of the SVS interface is down here in the bottom right hand corner. And that's the user notes. And it serves as sort of a electronic notebook for any of those additional notes that you want to add. And those notes will also be tracked here in the no change log. So now you have an idea of sort of how to navigate the interface here. Let's talk about data import briefly before we jump into our analysis.
So if we go ahead and take a peek at the import menu, you'll see several different functions for importing different kinds of data. And some of the importers are specific to certain platforms such as the affymetrix or the Illumina sub menus. And we find that once users do get their data imported into the software, they are entirely self-sufficient. And the FAS team here at Golden Helix is here to help you with this import process. And if you have questions of which type of import menu to use, please feel free to ask us and we can guide you through this process. But to go over that, here are some individual notes that were generated in my project from import, like our genotype spreadsheet and our phenotype spreadsheet and also some tests that I have run in advance. So if I go ahead and open up our genotypes spreadsheet. We can take a look at our data.
So we go ahead and open up our genotype menu and we can see a list of all of our samples that were imported into our project here in per row. And then our columns are the list of all of our SNPs that were important to project as well. And up here, we can get a summary of how many samples were added to the project as well as how many markers. Also, from this spreadsheet, we can see our marker map that can be accessed through this green map icon in the top left hand corner here. And it's going to contain all of our mapping information for each SNP. And so as I mentioned earlier, we want to make sure that our samples and markers for the study are of high quality and are accounting for any underlying structure in our data. And so to illustrate this point, I did run a naive association test before I did any quality control methods. So let's go ahead and take a look at that. And a great way to quickly assess if there is any inflation in our data is by looking at our QQ plots. And so I go and click on that.
OK, there we go. We can see here that we do have a little inflation in our data when we compare our expected P values to our observed P values. And the first question that we want to ask is why is this happening? But since this association test was performed on our naive data set, the explanation is pretty simple. We need to go perform some quality control procedures on our samples and markers. And so with that, let's begin with the first step and examine our geneotypic data that we imported into the project. When we're working with a dataset for the first time, we might want to, it's helpful to look at our sample statistics to determine any information about our samples, in particular, if we have any outliers. And so within our genotype spreadsheet, we can go to the genotype menu here and take a look at our both marker and sample statistics.
So with our marker statistics, we can calculate things like call rate allele frequencies and Hardy Weinberg equilibrium P values among other statistics as well. But what we want to do is we want to calculate our sample statistics to ensure that our samples have quality call rates and heterozygosity, which are always outputs when you calculate sample statistics among other optional outputs as well.
But let's go ahead and take a look at our sample statistics now. What we want to evaluate here is our call rates, and this will help us identify any samples that perhaps have poor DNA quality. And so we can sort this column by ascending if we want to, to get our lowest call rates listed here at the top. And another pretty nifty feature that we can do to sort of visualize this a little bit better to easily analyze our call rates across all of our samples is we can create a histogram. So if we go ahead and do that, we'll plot our histogram of our call rates. And we'll also go ahead and adjust these bin sizes to sort of give us a better representation of our call rates across our samples. And for our data set here it does look like our call rates look pretty good. It looks like our lowest one is around .98. But as a sort of general rule of thumb, low call rate thresholds are typically set at around .95.
But let's say we did have some lower call rates just so I can show you guys this. We can set our threshold here by right clicking on our call rate and going to activate by threshold. And for the sake of showing you guys what this looks like, I'm going to go ahead and put it .99 and we'll click OK. And we can see that those samples have been grayed out as these would be ones that had a low call rate that we would want to eliminate from our analysis. And so you'd want to subset this. But since our call rates look pretty good for this data set, I'm going to go ahead and skip this step and go ahead and move on to our next quality control metric. So the next step that we want to do is evaluate our cryptic relatedness among our samples. And identity by descent is a really great way to do this.
So if I go back to our genotype menu, we can find an analysis for our identity by descent, by going to quality assurance and utilities, and we can see here there's that IBD option here. But it is standard practice before running your identity by descent estimation to go ahead and do some LD pruning as well. This will narrow down to the to the SNPs that are in linkage disequilibrium with one another. And so we won't have to run our IBD on those. And so that will speed up the process. But generally LD pruning and identity descent can take some time. So I did perform this in advance. So let's go ahead and take a look at those results. Now.
So, again, I want to select our LD pruned genotype sheet where we can see our identity by descent outputs here. And so what I want to draw attention to, though, is this estimated pie spreadsheet, which is going to be our genomic relationship matrix. And so relatedness is often defined as family relatedness. But identity by descent estimation is really great for detecting duplicate samples and also sample contamination. And so from this spreadsheet, I have produced a heat map that makes visualizing this data a little bit nicer. So let's go ahead and open that up.
There you go. And so what we see here is a list of our samples each compared to one another. And you can see that by this green line here. But what SVS is actually looking at here is are allele frequency of each SNP to determine the expected rate of sharing. And then comparing this to what is observed from sample to sample. And so these probability values are listed here across the top, ranging from 0 to 1 and 0 is unrelated and represented in this graph as white. And we go all the way up to 1, which is related samples and they are represented in green. And so that's in this line is represented as each sample compared to itself. So it makes it make sense that that's a nice dark green line there. But depending on what your analysis is, you might be looking at siblings. So then you would expect these values to be around .5. If you're working with unrelated samples, you would generally expect these values to be around .2 and lower. But after we go ahead and evaluate our relatedness with our samples and possibly exclude any samples that we want to, we can go on to our next quality control step, which is to identify any samples that may depart from our expected homogenous margins populations within our study. And so we can do this by performing principal component analysis. And what it will show us is essentially our underlying population structure. And so, again, we'll continue to work with our filter dataset and to run our principal component analysis. And so we'll go to our genotype menu again. And it will scroll down to our principal component analysis. And so here we can see a variety of options that we have to control our principal component analysis. You can pick how many principal components you want to analyze as well as select your genetic model. In this case, I have the additive model selected and you want to make sure that this model does also match the model that you run for your association test. And then also we can determine which spreadsheets that we want to output. And similarly, we go ahead and click run. And the results of your principal component analysis are two spreadsheets, one spreadsheet with a list of all your samples and the principal components listed as well. But you do also get this concise list of principal components and their corresponding eigenvalues.
And so if we look here, what you're looking for with this spreadsheet is you're looking when these eigenvalues start to become similar to each other. And so we can see that we have our biggest difference between our top two here, but maybe also this third one. So we might be looking at correcting our data for the first two or maybe three principal components. And so from this spreadsheet with our principal components, I've also merged that with our phenotypic data so I can get a plot of our principal components by breed. So let's go ahead and open that up and take a look. I've compared our top two principal components on our X and Y axis. And then I have also colored our plot by breed. And so here we can see breeds that are perhaps are not so related to each other, but then also those that may be related. And with this plot, with our distribution here, it does look like we may have three principal components explaining our population stratification because we sort of have three main clusters that we're looking at. But so from here, we can now run our filter genotypes and then also correct for our principal components in our association test. So let's go ahead and change gears and do that now.
And so I have merged our genotype and our phenotype spreadsheet here for our association test. And so we can go ahead and open that up. The first step that we're going to do, is we're going to select our variable that we would like to run for our association test and then we'll go back to our genotype menu and come down to our genotype association test. And so here we see all of our options available for running our association tests. And so we have that additive model that we want to make sure that we select based on our principal component analysis. We also want to make sure that we get the output of that QQ plot so we can compare that to our naive association test. And additionally, we want to make sure that this box is checked to correct for our population stratification. And to do that, we also want to make sure that we select that pre-computed principal components sheet, select that here and then we can go ahead and click run. And what we're really going to want to look at is that QQ plot to see if we improved on our inflation for our data. And so I've got this spreadsheet over here. Just pull that over. OK. And so take a look at that. We will want to go ahead and make that QQ plot of our expected P values on our X axis and then are on our Y axis. We'll use those observed P values. And let's take a look and see if our plot got any better.
OK. And then we'll go ahead and also add a slope line. And so, yeah, that looks much better. We may still have some inflation, but otherwise this all looks pretty good and then we can also see our significant SNPs up here. Another great way to visualize this data is we can go back to our key values here and then we can go ahead and make a Manhattan plot of our P values so we can see our significant SNPs across our genome very, very nicely. And then we have a great, great graph that's ready for publication. We can color that by chromosome. And there you go. And there's again, those significant SNPs.
And so performing GWAS is really a great way to start analyzing your data. It really makes sure that any future statistical tests that you run will be on quality samples and markers. And so as I mentioned earlier, we are going to follow up this GWAS with a genomic prediction. And so to set the stage for what we need to know and what our goals are in performing genomic prediction, I'm gonna go ahead and switch gears and go back to our slides.
All right. So why do we want to use genomic prediction? Genomic prediction is becoming more and more important as a focus for plant and animal research and agrigenomics in general. And this can be explained by our exponential growth in the worldwide population due to advances in both science and medicine. And so according to this chart, our population is estimated to increase to 8.5 billion by 2030, which is about a 16% increase. Naturally, our increasing population requires us to improve our food production techniques, to increase yields, but also to find environmentally sound and sustainable methods to support our future population. And one method that is being used to address this concern is genomic prediction. And so as a use case for genomic prediction, we might be calculating breeding values to say, select for larger cows, for example. Then historically we might rely on pedigree or observational data to determine which cow is optimal to breed. Perhaps if we consider the sizes of these two cows here, we might pick this one. However, if we look at their genetics, we might discover that this cow happened to be large relatively by chance, and we might find that this cow, the seemingly smaller cow, genetically has a better combo of genes to produce larger cows.
Another consideration for the use of genomic prediction is to predict the phenotype for breeding values that are either costly or difficult to measure biologically. Some examples of this might include milk production in bulls or perhaps beef quality metrics that are difficult to ascertain while the animal is still alive. And so genomic prediction is a solution to assess this genetic potential of the samples in your dataset and discover which SNPs have the best predictive power for a given trait.
And so now that we have sort of an understanding of what genomic prediction can be used for, let's talk about the available methods in SVS. And so SVS currently has three genomic prediction methods available, gBLUP or Genomic Best Linear Unbiased Predictor, Bayes C and Bayes C-Pi. In the plot here on the right illustrates that gBLUP and Bayes C-Pi do produce similar results. However, there are some tradeoffs between the two methods and it's important to be able to compare them and choose the best method for a particular trait. So let's go ahead and look at some of the differences and similarities now between gBLUP and Bayes C-Pi. So one of the main differences is that gBLUP includes all SNPs and/or markers into the model to predict the phenotype. In Bayes C-Pi, only a select number of SNPs are selected to produce an ideal model. And another difference is that gBLUP computes results a bit faster than Bayes C-Pi because generally Bayes C-Pi is more computationally demanding. However, it does lead it to more concise estimates of the markers effect. Part of this is due to Bayes C-Pi, including the calculation of that Pi parameter, which is the probability that any SNP will have no effect on the phenotype. Additionally, in Bayes C-Pi the variance is calculated at each locus, but in gBLUP it is a constant value. Both methods do produce similar output, such as allele substitution frequency or ASE, breeding values in predicted phenotype. And they both do incorporate a genomic relationship matrix and we're going to use some of these outputs later to compare these models.
And so a quick quiz question for you guys. In which example do you think it would be more beneficial to use Bayes C-Pi over gBLUP? And I'll give you a couple seconds to answer that one as well. OK, so it looks like you all guessed right. It is when you have variation being controlled by strong quantitative trait loci. Bayes C-Pi is known for being very good at predicting phenotypes in that scenario, and if you did have 1,000 samples, you may want to select gBLUP because it is it's a little bit more quicker to get you your results.
But at this point we have covered which methods are available in SVS for genomic prediction and highlighted some of the similarities and differences between them. However, an important scenario that I want to briefly consider is when we have a set of data with known phenotypes and we want to use that dataset to predict phenotypes for future populations, how do we know how well our dataset will be able to accurately predict those phenotypes? And to address this question, we can use k-fold cross validation in our genomic prediction models. In how cross validation works is, let's say in our dataset, we have 100 samples with known phenotypes and we want to use those 100 samples to make predictions for an additional 20 samples that we don't have the phenotypic pick information for. What we can do is we can break down our 100 samples into subsets. And in this case, we have five subsets or folds. And within each subset, one phenotype is chosen as the test sample, meaning this phenotype will be missing. And then the other samples in the subset are considered training samples where they have their known phenotypes that are gonna be used to predict the phenotype of the test set. And ultimately what's great about this is all of your data ends up being used to predict the phenotype, but is also used for testing as well. And so k-fold is undoubtedly a desirable inclusion for your genomic prediction. But carrying out this process manually can be quite time consuming and labor intensive; to subset all the data and select data for training and testing for each fold. And as we're going to see in SVS, it makes performing k-fold much faster. And so let's go ahead and return back to our cattle data set and take a look at some of the results from our genomic prediction.
OK, so we're going to start again with our emerged genotype with our filter genotypes and our phenotypic information and we'll start looking at gBLUP. But first, I do want to run through the parameters I use and then how you can set up genomic prediction runs. And so what we'll do is we're going to first. Get this sheet open. There we go. And what we're gonna do is we want to select our phenotype of interest, so in our case we're gonna be looking at phenotype five and so you'll go ahead and go to the genotype menu again.
And if we scroll down here, we can see all of our options for our Bayesian methods, gBLUP as well as that k-fold cross validation for a genomic prediction. So we're gonna go ahead and select that one, since we definitely want to include that in our prediction. And here we can see the methods that we can choose from.
In my case, I did select both methods and then you can set your Bayesian options here like how many iterations you want to run. Additionally, you can correct for gender or any other covariates that you would like, and then also you can set up your k-fold cross validation here for the number of folds that you might want and the number of iterations as well. And you can also stratify your folds by a variable. And in my case, when I ran this, I used a breed. And so then we would go ahead and click OK to run this. And so again, this does take some time. So I have computed these results in advance. So let's go ahead and take a look at those now.
So you can see here that our output from running gBLUP, we can see each fold and within each iteration and we can look at the results on a per fold basis through these spreadsheets here. But what I like to do is I like to go ahead and jump to the final results of the iteration and take a look at those where we can look at our actual phenotypes versus our predicted phenotypes. And so here in the spreadsheet, we can see in which sample and then which fold of these predictions were made. And what's really great about this spreadsheet here is that we might want to look and see if we want to exclude any of our samples from our prediction in case they didn't do a good job at predicting. An example of that might be a sample eight here. And so a better way to really look at this is to plot our expected predicted phenotype with our actual phenotype. And in this graph here, I also colored the data points by fold to just sort of give us a better idea of which of our predictions per fold. And something that we might want to look at here is any clustering or outliers in that, it's also helpful by coloring by fold. But really what we're looking at is we want to make sure that this relationship would be linear. And so it seems here that we do have some unexplained variation going on. And we can take a closer look at this in a minute. Looking at our R-squared value to see if that's true. And also what we can take a look at with our sample statistics is our correlation coefficient. So let's go ahead and take a look at those summary statistics now. So here for our gBLUP run, it looks like our correlation coefficient is around point five, which isn't too bad. But also here we do see our R-squared value and this value is a bit low, which would explain why we might have some unexplained variants from our model. And it was a little bit more scattered.
But an interesting question that I want to ask you guys at this point then is do you guys think that the R-squared value will be higher, lower or the same when we do our Bayes C-Pi run? And remember that Bayes C-Pi calculates variants for each locus as opposed to keeping it a constant value. So go ahead and take a second to vote and then we're gonna go ahead and remember this value of .23 and we'll go ahead and take a look at the results of our Bayes C-Pi run.
OK, so it looks like everyone's thinking higher. So let's see if you guys are right. So let's go ahead and then look at our Bayes C-Pi results.
And so, again, the results are similar to what we get from gBLUP. We can see per fold results or we can take a look at those final results from our Bayes C-Pi run. And we can go ahead and take a look at that plot for our predicted phenotypes against our actual phenotypes and see if the graphs do look similar to each other. There we go. OK. So it does look similar to the run that we performed with gBLUP. But perhaps we do have a bit of clustering here from fold number two.
So that might be something that we would want to look into and investigate a bit more. But to really compare the two methods, we're gonna want to go ahead and look at those summary statistics and look at our correlation coefficient, but also R-squared value. So here we go. Open this up. Moment of truth.
Looks like our R-squared value is a bit higher. So you all did really good on that and that would be because the Bayes C-Pi run does explain a bit more of the variants. So that would make sense that we would get a bit higher of a R-squared value. But also what we're seeing here is that our Pearson's correlation coefficient is a bit higher and so we might decide that running the Bayes C-Pi is our preferred method specifically for predicting phenotype five. But again, we did see that the results were overall pretty similar. But practically the last thing that I want to do to sort of wrap up our story here is take a look at the SNPs that contributed most to predicting our phenotype five. And so to evaluate this we can look at the allele substitution effect, specifically the normalized allele substitution effect, which is the average across all of our folds. And so I plotted this in advance for both our gBLUP and our Bayes C-Pi run so we can compare them side by side.
And so we all go ahead and take a look at those Manhattan plots now. So what you might notice here is that there are a few differences in especially in this gBLUP run, it seems to have more markers in it and is a bit more noisy. But remember, this makes sense because we are including all of the loci in the model, whereas Bayes C-Pi is going to hone in on those specific set of loci that are most ideal for the model. But here we can see all the SNPs that contributed most to phenotype five and where they're located in the genome. And if we go ahead and click on one of these, we can in the bottom left hand corner here we can see the chromosome number and position as well as the name of the sample that that SNP is found in, as well as the genotypic information for that marker.
And so with this, we can definitely see which SNPs most strongly influenced the prediction for our trade of interest. That was a lot of information to cover. But hopefully this demonstration has shown you how easy it is to use SVS for efficient quality control analysis in GWAS. But also genomic prediction with gBLUP and Bayes C-Pi to get us down to those SNPs that are estimated produce are predicted phenotype. And so at this point I am going to return back to the slides and wrap things up, but also answer some of your questions as well. OK. First and foremost, again, I would like to acknowledge that we are very grateful and appreciate the NIH grant funding that we received. But I also want to remind everyone at this point to input any final questions that you have into the chat panel and we can answer them the best as we can now.
And while you are submitting your questions into the question pane, I want to mention that Golden Helix will be at the Plant and Animal Genome conference this January and I will be there and I will be demoing some other workflows within SVS and so if you're going to be there and would like to see SVS up close and personal or ask us any questions, please feel free to come over and stopped by. Now I will turn things back over to Delaina and she will provide some more details on PAG as well as some other housekeeping items.
Yeah. Thank you, Julia. That was a great presentation. As she mentioned, our team is looking forward to heading down to San Diego, California for the PAG 2020 conference in just a month.
We are looking forward to connecting with everyone who will be attending. So I'll just continue one last poll here to see how many of you in the audience will be attending today. As you're filling that out, I will mention that at the conference will be hosting a variety of demos in our booth by Julia in booth number 231, and we cannot go without mentioning our infamous T-shirts, which will be there as well. We will be handing those out, but please keep in mind that attending one of our demos is your ticket to taking one of those t-shirts home. So stop by there about five minutes long and a great insight into SVS. And then secondly, for all of our viewers today who are considering SVS or maybe some of you who could add a few additional licenses for your team, now is an excellent time to do so. This year we brought back our very popular end of year bundles, which are a variety of different multi-user software packages that are at heavily discounted prices. So listed on your screen here, are two of our SVS related packages that we are offering this year. The only caveat for these bundles is each of them is limited in number. So if there is one that you like, you'll have to buy it before they sell out. I just sent you all a link to the bundles on our site so you can see what remains and go from there. All right. So that's it for events for me, Julia. We can go ahead and move into the Q&A if you're ready.
First question is, is there any limitation to the number of markers or samples that can be imported into SVS?
That's a really good question. Not really, no. SVS is designed to handle massive datasets that are necessary for increased power with association tests. So we have a lot of users that are handling thousands of samples and millions of markers for association test, large end studies and also even imputation. So but our FAS team is always available to help troubleshoot the import process with you.
Perfect. And second question, what if we are working with a species that has no current reference assembly? Can we still perform association tests in SVS.
Yes, you can. You thought you may not be able to map the markers or plot the various annotations. You can still perform basic association tests to identify any of those significant markers, even without application of a marker map.
Great. Well, unfortunately, we are running out of time or at the top of our hour, so we will not be able to get to everyone's questions today. But there are a number of you who did ask a question. So we will be reaching out with answers to all of those via email. And additionally, we will be sending out a link to today's recording, which also includes the slides. So keep an eye out for that sometime this week. Thank you, everyone, again for attending today. And thank you, Julia, for this great GWAS genomic prediction presentation.
Hope everyone has a good rest today.