Whole Genome Trait Association in SVS

About this webinar

Recorded On: Wednesday, July 15, 2020

Golden Helix’s SNP & Variation Suite (SVS) has been used by researchers around the world to do trait analysis and association testing on large cohorts of samples in both humans and other species. As Next-Generation Sequencing of whole genomes becomes more affordable, large cohorts of Whole Genome Sequencing (WGS) samples are available to search for additional trait association signals that were not found in array-based testing. In fact, recent papers have shown that WGS analysis using advanced GREML (Genomic Relatedness Restricted Maximum Likelihood) techniques is able to outperform micro-array based GWAS methods in the analysis of complex traits and proportion of the trait heritability explained.

Our latest update release of SVS has expanded the exiting maximum likelihood and GRM methods to support these new techniques. We have also enhanced various other association testing and prediction methodologies. This webcast showcases:

  • Newly supported analysis workflow for whole genome variants using LD binning and enhanced GBLUP analysis
  • Enhanced gender correction using REML
  • Additional capabilities for genomic prediction and phenotype prediction

We are continually improving our products based on our customer’s feedback. We hope you enjoy this recording highlighting the exciting new features and select enhancements we have made.

Watch on demand

Please enjoy this webcast recording. Should you have any questions about the content covered, please reach out to our team here.

Download the slide deck

To download a copy of the slides, click on the LinkedIn icon. This will redirect you to the SlideShare site. From there, you can clip your favorite slides or download the entire deck to your computer.

Love this webcast? Check out more!

Find out how Golden Helix software enables users to harness the full potential of genomics to identify the cause of disease, improve the efficacy and safety of drugs, develop genomic diagnostics, and advance the quest for personalized medicine.

Transcript

 

*Please note that you may experience errors in the below transcript, therefore, we recommend watching the video above for full context.

 

Hello, everyone, and thank you for joining us for today's webcast presentation. My name is Delaina Hawkins, Director of Marketing and Business Development, and it is my pleasure to be today's moderator on this presentation of the Whole Genome Trait Association in SVS. And joining me today is our presenter, Gabe Rudy, who is our VP of Product and Engineering. Thank you so much for taking the time to be here today. Hows it going?

 

It's a pleasure to be here.

 

Glad to hear it. Before I pass things over to you, I just want to remind our audience that we will be accepting questions throughout this entire webcast. On the screen here you can see where the questions tab is in your GoToWebinar panel. And that is where you will submit your questions. And for anyone who is curious, these will be anonymous. So hopefully no hesitations there.

 

That's it for me, Gabe. I will go ahead and pass things over to you and let you get started.

 

All right. Fantastic. Thank you, Delaina. So what we're going to be talking about today is some capabilities and this headline feature in our upcoming release of SVS about Whole Genome Trait Association. So a very exciting feature and of course, I want to get to that. But of course, I also want to make sure to give you also a proper background about the company and some other things about SVS. So first and foremost, we received recently, some grant funding from the NIH. We are, of course, incredibly grateful for the research reported in this publication was supported by the National Institute of General Medical Sciences and of the National Institute of Health under the listed awards. Additionally, we are also grateful receiving local grant funding from the State of Montana. Our PI is Dr. Andreas Scherer, who is also the CEO of Golden Helix. And the content described today is the responsibility of the authors and not officially representing the views of the NIH. So again, we're thankful for grants such as this and provides a huge momentum in the development of a quality software here at Golden Helix. So let's just spend a minute going over a little bit of the background of Golden Helix, who we are, what we do, and the context of the method and the product that it's going to be displayed in here.

 

So as a context here, we were funded or founded, rather, in 1998 as a bioinformatics company and a little bit about our history and background. We're one of the few bioinformatics companies that can say we have spanned, you know, 20 years of experience building solutions and the research and clinical space. The solutions have a very broad range of capabilities. And obviously we're not covering all of those here.

 

They do cover a lot of different use cases in the research and clinical markets and including the ones we're gonna talk about today, related to Whole Genome Trait Association. If you're interested in these other capabilities, please visit our website. We have recorded webinars, of course, as well as lots of other material for you to investigate on your own.

 

We've been sighted in over 1000 peer-reviewed publications, and we are always happy to celebrate this success of our customers and performing the research with our tools. Fact, if you go to our blog, you'll often see us highlighting some of that success.

 

So these customers are spanning many different industries, from academic institutions to research hospitals, commercial testing labs, government institutions. So we're talking about, you know, over 400 customers. And this really does span the globe. It's not uncommon for somebody in our team focusing on training and support to have a call with somebody in Germany in the morning and then Australia in the afternoon, in India at night and all these types of things. So we really are. It's a pleasure for us to to serve a wide spectrum of customers in a global space.

 

So as a company, our software is being vetted by this large and established customer base. So we have many years to mature our capabilities and we're always incorporating feedback from the customers and their diverse set of needs and all the different use cases. We are very integrated in the scientific community, providing content in the form of ebooks and webinars such as this one. We also participate in groups such as GA4GH that are working to establish standards for data sharing or data representation. And one example of that is our recent series of webinars on COVID-19.

 

So that covers a lot of detail about analysis workflows to support cluster detection, strain analysis, as well as tracing infection chains and stuff. So I encourage you to go look at that. We also have a bit of an update on that topic, and I'm going to cover that at the end of this webinar and a couple of slides. So as a business, we want to be aligned with the success of our customers. And one way we do that is we provide software on a simple per-user subscription model that comes with unlimited training, unlimited support. So we're not tracking or charging based on the number of samples you analyze or how much data you process. We're really invested in supporting the research of our users and the clinical tests that they're setting up. And we want them to have the success of running as many samples as possible through our software and in their work. So, you know, overall, we're established and trust business in this space. It's really hard. It's not hard, actually, to find someone to talk to that's used our solution if you want to reference. So we're always listing our customers and listening to them to improve our solution and engaging with them. And that goes into an iterative process. And part of that is what you're seeing today is where we're talking about the upcoming release of one of our products. And it, of course, is an effort not only on the research side but also in bringing in all this feedback that we gain over time from our active user base.

 

So let's talk a little bit about SVS. So this is are our suite of capabilities that is really focused around analyzing large number of samples, but we are actually species agnostic. So, as I mentioned, you know, we recently had some webinars about doing virus analysis on SARS-CoV-2. And we're going to do a little update on that. In particular, our CEO has been published in a number of papers and articles describing how SVS can be used to analyze and trace whole genome samples of the SARs COV2 virus. Of course, we're also able to analyze large number of samples in agrigenomic context in a plant breeding programs. And, of course, model organisms, humans, different builds of humans. If you're aligning to newer versions of the human reference, genome, et cetera. And we support bringing in data from any number of platforms, microarray platforms, next generation sequencing. We've really seen through these many years just about every file format that is produced in this field. And we hope you get that data in and that we hope you get it cleaned up, manipulated, joined up, ready for the type of testing, research work that you're doing. And we have fantastic integrated visualizations. And this is really one of these features that people just grow to really love inside of our tools. So whether you're using SVS or VarSeq, we have this integrated capability, which we called GenomeBrowse. It's also available as a standalone tool, but it displays genomic information right along your research data, your clinical data, your own data. And this brings in annotations from the outside. It displays things like VCF files and BAM files and puts the results into a genomic context that's easy to navigate and visualize. And then also we support a number of different research workflows. So this covers things like trade association, population analysis, genomic prediction, where variant analysis, where CNV analysis and complex analysis, where you might be tracing a more complex, dominant model disorder across, you know, a huge family cohort. And or if you're just doing cohort analysis in general. So SVS is the platform for these types of association tests. Population analysis and looking for clusters of samples and looking for complex traits across samples.

 

So let's talk a little bit about some of the kind of that's sort of the core functionality of SVS. And there's a lot of additional capabilities that you're welcome to add to your package as well. And I just want to review those. Sometimes there's so many things people don't even necessarily know all the capabilities that the platform that they've been using has. So one thing I want to mention is we developed a fantastic NGS variant caller and we did this with one of those SBIR grants that we mentioned at the beginning. And we did this also to support clinical testing labs who are doing NGS calling on things like their targeted gene panels or their targeted exomes. But we brought this capability in to SVS as well as researchers, often have access to NGS data. So whether it's in targeted gene panels or exomes or whole genomes, and we have different strategies for those two use cases. We, of course, still support CNV calling based on microarrays and the type of analysis that follows downstream of that. We also have a fantastic variant annotation and filtering capabilities. So annotating against common variants. Being able to classify variance as being loss of function or mis sense or synonymous. And similarly, a set of capabilities for annotating and filtering CNV. We have a specialized tool for ranking the genes that are overlapping your variance and CNV. Based on phenotypes defined by the human phenotype ontology or HPO terms. And we have a couple other sort of capabilities as well. Things like meta analysis across studies and genotype amputation, where you might have an issue trying to bring together microarrays that are on different platforms and being able to impute their nearby steps and get them all harmonized or go from sort of microarray to NGS level variant density. These are all things you can do right within SVS.

 

Now, let's touch a little bit about some of the things we're gonna talk about today and in the context of the kind of development of the type of standard numeric analysis and variant analysis methods. This is sort of bread and butter SVS stuff. So, again, you know, over 20 years of experience, we've seen the the research on genotype association testing advance as well. And so part of what we're talking about today is sort of a very new progression of that advancement of methods over time. But it's also helpful to see sort the lineage of these strategies and how all these capabilities are still available in SVS. So, first of all, we have principal component analysis and we're going to mention this again in the context of COVID 19. But this is useful in any context to be able to understand the underlying structure of a set of samples. All samples in a population are related to some level. And you can actually pick out some of that relatedness by using things like principal component analysis. It really reduces the dimensionality of a very large matrix to a number of very few components that you can visualize and graph such as this one down here. In our standard, GWAS type workflow, so Genome Wide Association testing, you're generally comparing a binary or continuous variable against a single marker at a time, and then each of those tests produce a statistic like a P value. And then you might actually just look for the presence of that statistic, having a peek. Something like this apply here. And then you know how the association between that genomic region and a trait. That's something that SVS has always done in it. And it does it very well. It has a number of different capabilities for multi-market correction. You can correct for PCA components, all that kind of stuff. But as you get into rare variants, you have to try to account for the fact that individual rare variants don't have enough alleles present to really be associated strongly with a trait that occurs in a large number of samples. So you try to collapes maybe those rare variants and into groups and bins; we support that. We also support family based tests. And then what we're talking about today, we'll start to get into this type of analysis called mixed linear model analysis, where you might start to look at multiple markers of a loci, as well as other ways of correcting for population structure, other than principle component analysis. We can also look at just pairwise relationships between samples using methods like IBS and IBD, Identity by Descent and Identity by State, and as well as being able to compute genomic relationship matrixes. Now standard single kind of sample analysis. Or even when you get down to a set of candidate variants and you want to understand how those variants impact genes and things, we have fantastic methods for that built in to SVS as well, including gene impact predictions, being able to produce HGV s descriptions of variance, the type that you would put into a clinical report or a research paper describing its impact on the gene splice site predictions, functional predictions, conservation scores, and of course, just general filtering and other types of annotations that you can do on the variance themselves or their genotype states all that flexible capability. So let's get to dive into a little more detail on mixed model capabilities, because this is really the background that we're going to talk about, this new method that we're describing today. OK. So the first wave of GWAS papers were based on this single marker association testing strategy, where they corrected for population structure by incorporating in like principal components as covariance into, let's say, a linear regression, something like that. Right. This turns out to have a number of problems. So there's a bit of a subjective nature about like how many principle components do you put into your model? You can start putting in too many and you're actually sort of normalizing out part of the signal because the principal components doesn't know that it's picking up population structure or actual structure of the trait itself. Like, it's just picking up whatever it can find as as high dimensional data so that's a problem there. But there's also sort of this matter that you're just also picking an arbitrary number of principal components. You're not necessarily incorporating a true relationship matrix as a component of your analysis. And so that's what the genomic best linear, unbiased predictors, style strategies do is that we can create a matrix of relatedness between samples. We call this like a kinship matrix. Right. And we can incorporate that.

 

What we're gonna do is so this is what a GBLUP analysis can do, and it uses a restricted maximum likelihood method which we abbreviate to REML to estimate the genetic variance explains by the genotypes. So we call these per genotype outputs the allele substitution effect. But you can kind of think of them as the ability of a genotype to predict the analyzed phenotype. Whatever you have as your just phenotype, which is and this in this model is like a fixed effect. So we have to REML algorithms implemented in SVS, the one we're talking about today, is called the average information algorithm. And what's special about this is although it can incorporate genomic relationship matrix into the model, can actually incorporate more than one.

 

We're going to use that strategy. So this is implemented to support things in general. We implement it to support things like gender correction, as well as being able to correct for gene by environment interactions. But we're gonna use this new method is going to use this strategy as well. We also have a bi variant GREML method that estimates the genetic correlation between two traits, when looking at all of your genomic data. So the next method in this list here, the GWAS mixed linear model analysis algorithm. So we have two different modes here, for this algorithm. So there's a couple of different modes. One is just if you were to do it in a straight linear regression. Then it's basically just a regression. But it's more interesting is that this is allowing you to do association testing where you can do either single marker or multi marker, multi locus mixed model that uses an MX implementation.

 

And in this case, the dependent variable is the fixed effect. And then the kinship matrix where your genomic relationship matrix is included as a random effect. And then the multi locus form. This takes sort of an iterative approach, defined genomic markers to include as additional fixed effects in the model.

 

The bottom image here is kind of giving you this idea of the output of this approach where you have this cumulative work done and each step you can kind of see the progress of the stepwise approach. And this is actually a nice visualization of what these G genomic really should matrixes can look like. It's you can start to see there's like a numeric representation describing the relatedness between samples, so samples on the left and samples on the bottom. And so there's essentially obviously a there's a mirror of the lower half of this triangle and the upper half because the matrixes is covering sort of this end by end pairwise relationship. So other features that support using a GRM in SVS include Bayes C and Bayes C pie. And these are often employed for genomic prediction. So that's sort of our standard capabilities. We've already had these capabilities that are extremely popular and they're generally used instead of kind of a single marker trade association methods these days because they do such a good job of correcting for the population structure.

 

But this is kind of where this new method takes us is it gets us into an even more advanced approach and to motivate why we are taking another approach than just what we just saw. Let's talk a little bit about this problem that's called the missing heritability problem in genomics. So what do we mean by heritability? So heritability is the proportion of a phenotype attributed to genetic factors. And actually, you don't even have to look at a genome to figure this out. You can just look at the fact that if you took monozygotic twins. So people who have the exact same genome and would look at a trait, let's look at height or body mass index, you can overcome by looking at multiple ones of those are analyzing other types of family data structures. You can get a very good estimate of how much of that trait is explained purely by the genomic data or their genomes their genomic content of their genomes vs. other environmental traits and the way they are raised and all the types of other components that affect physical attributes. And the problem is, when we look at the single marker tests that were done in the GWAS era, they do not explain all of this estimated heritability. You could look at their estimated heritability and it's actually only a small proportion of the total estimated heritability of complex traits. So this really puts a cap on how, you know, how much you can take, the types of predictions you get from from a GWAS study and apply it to the population in whole, but even an individual. It gets even more ridiculous sometimes when you try to take these things that have very small effects and apply them to just a single person. But that's in essence, what you're trying to do with a lot of these kind of like direct to consumer tests like 23 and me. And it can be a lot of fun and some traits are actually very useful to understand. You have an increased risk of that or an increased risk of that. But a lot of these traits that are being trying to predict through like a 23 and Me platform are missing a lot, well, you assume that they're missing other parts of the genomic content, that would explain more of the traits because they're only looking at common snips. They're only looking at snips with very large little frequencies in the population that they can predefine in the arrays that are run very cheaply, that you run on the spit that you put into a vial. So where is the missing heritability? Well, a promising hypothesis is that if you were to look at the rest of the variance in the human genome, which, of course common variants are small subset of the total variance that you would find in any whole genome, that in aggregate those would potentially account for the rest of it. But how do you sort of test that? How do you put that into a computational model? It's been very difficult to do that. But this new paper demonstrates a strategy where they can get an almost complete total recovery of the estimated heritability of complex traits in the two in this paper are height and BMI. So here's this paper on the right and you give it a little bit of background on how it works at a high level. It's not super complicated and it's something that we've implemented in SVS. But it builds on a lot of the complicated things we already had in SVS. So we're able to leverage these years of effort building these very sound models and then implement this new strategy on top of it by expanding some of the capabilities we have and then automating a lot of capabilities that were essentially there but would take a lot of time to do by hand. So, whole genomes, of course, have this massive spectrum of allele frequencies and what the the approach here is, is trying to control for the population structure and while accounting for the fact that there is probably a different structure at different bins of these allele frequencies. So you can think of the fact that common snips are inherited much further back in the ancestral tree where snips coming from a much smaller boundary in your ancestral tree, probably because they're they're private to your own, you know, your own lineage. So this new strategy is essentially this is the strategy. You take your rare variance and you can bin them based on their minor allele frequency. And then they also found that if they include linkage disequilibrium LD as another component of bining. So you can also use that. And by that they mean like for a given variant. How high is it in LD with its neighbors? So essentially, how much of the state is generally shared across all samples? So you have basically probably very likelihood that those two variants are inherited in the same inheritance chunk. So those two things and they're not. And the nice thing about them, as they seem to be not directly correlated MAF and LD are very good at separating markers into different groups. And then there's different groups different groups of markers can each compute their own genomic relationship matrix, capturing slightly different components of the relatedness of the samples. And then you can put all of those multiple genomic relationship matrixes into a linear mixed model analysis. And there's the standard paper here. And you can have this giant list of people from different groups, including people who do a lot of the method development, as well as people who do a lot of the giant cohort study. So the samples that we're bringing in here for this paper were from Top Medd, which is this fantastic whole genome, very, very large sample collection process that's still going in place. But they're starting to do research on it. And then the people doing the method development have done a lot of the popular methods that we talked about today. They work on those methods in those packages. And so this is sort of a combining all the advancement of those. And there is a second paper that follows up on this. Solving the missing heritability problem, which describes this method as GREML WGS. So that's what we've implemented. Now, here's kind of the call out here, is that by combining these two things MAF and LD scores to define the bins, they provide an estimated heritability in their genomic models of 0.79 for height and point 0.40 for BMI. And that is almost exactly what you get from looking at these pedigree studies where you just look at not the genomic content, but how much of these traits are shared between people with the same genome or similar genome that kind of put into these models. And here's another interesting thing. So, again, we're wondering, where's the missing heritability? Well, 50 percent of the heritability of height and BMI in these models are explained by variance that have less than one percent of their minor allele frequency. Of course, we're talking minor allele frequency because you have enough samples here that you can just look at which variant or which allele is minor in the population. When you talk about next generation sequencing, we talk about alternate allele frequency because we're comparing to the reference and we're looking at the ultimate level, roughly interchangeable. But in this case, where we're working as a population, so we're talking about minor allele frequency. So here's how this is implemented in SVS. So we have a couple of new capabilities. And then in the next step will actually run through these. And then we have a number of enhancments to our existing capability. So including things that we've just been adding over time, given the number of users who've been asking for this or that output and things like that. But I wanted to cover all this in one go here. So the first thing we're gonna do is be able to create a LD score for every marker. And we can also obviously create a minor allel frequency score for every marker. And then we can create bins that combine these two things. And essentially what we're doing is for everything that's in, say, a minor allele frequency, bin one. We then can bifurcates that bin or you can choose how many sub bins you want based on the LD score. So you sort of use that as a second level of splitting. You can split first on minor allele frequency and then split second on LD scoree. And then we're going to take all of these bins. Every marker is assigned to one bin. So any marker is not in more than one. So each bin has a unique set of markers from the genome. And again, these are going to be distributed across the genome. It's not like markers are in one chromosome or something like that. Each bin contains markers across the entire genome. And then each one of those gets a genomic relationship matrix computed and then they get each put into, the analysis and we get random effects based on each GRM computed for each sample. And then we get our allele substitution effects computed for each marker. So we have this per sample output and we have this per marker output. Again, these are very useful in understanding the nature between the selected phenotype and its ability to be predicted by markers and the ability of samples to account for to have its phenotype predictive, essentially. So GBLUP analysis has been improved in general. So we've added the ability to make predicting missing phenotypes something that's possible in all cases. It used to be a subset of cases that that was possible and we expanded that. We also have a new output or we output the intercept and the fixed effect betas values. So that includes all of especially for all your additional covariance, you might add. We also have other useful outputs. Now, given for many cases where the data only partially fit the model. So as you can imagine, this is a because it's sort of an iterative process. And we're generally converging on a result. Sometimes there is essentially no strong trade association going on at all. So the model has a hard time getting to this convergence and a certain point we basically have to stop and then show you what we came up with instead. And that gives us a lot more output kind of ability of that output to be useful, to understand what happened as opposed to essentially not giving you anything. And then, of course, there's been various other improvements to the dialogs. We're trying to make this robust tool use cases and have useful alpha messages. A lot of these based on users requests.

 

So with that, I'm just gonna pop open SVS. I was going to actually pop it open right on top of you, and we're gonna go through a couple of these capabilities for my project here. So make this full screen. And I'm gonna open up this project. And, of course, you know, I would love to do this, you know, on a whole genome project with, you know, millions of variants and all that kind of stuff.

 

But I don't have time. SVS scales very well to that type of stuff. But I created sort of very small thing that I can run through in a couple of minutes. I'd like to get to a couple more things here, including, as I mentioned, a little bit of an update on COVID 19 and SVS. So we have just a very simple binary phenotype here and we have genomic markers spanning the whole genome. Now, even though we only have 1400 markers to keep this simple. These do reflect markers across the entire genome. I'm plotting this in GenomeBrowse this fantastic visualization tool so you can see that across my 700 or so samples. You know, I have data across here, but this is a mostly common snips, as we'll see here in a second. So the first capability we're going to run through is this binning capability. And then we're gonna run through the G\BLUP that incorporates the bend. So the binning capability is available in this new tool called LD Score Compute Computation and Binning. Now, if I don't choose to create bins, what we'll first do is just compute an LD score. Let's do that just so we can kind of see the nature of what we have in our current dataset. I have an output here that includes LD scores and minor allele frequency for every marker; my 1400 markers in the genome. I can expand this out to see where those are in the Genomic etc. Let's just create a histogram of these two things. So I'm going to go and look at my histogram of my minor allele frequencies and I'm going to expand out the number bins so I can kind of get a sense of what's going on here. And as you can see, like, I basically have the same number of markers and you know, 10%, 20% , 30%, 40%. This does not have a whole lot of rare variance. So we're not going to get crazy and start creating bins for, you know, 1%, 0.1%, 0.01%. But we could and that's actually something you'll see in the dialog as we get back to that. But we do have a mostly sort of a spectrum of common variance, but a good separation of variance across the minor allele spectrum. Let's see if the variance are also separated based on their LD score. Yeah, and sure enough, we have a pretty good separation here as well. It looks like a lot of markers have very, very low LD. So they're not related to their nearby snips, but then there's sort of this long tail of relatedness. And so there's probably if you were to bifurcates this on average, it'd be like, you know, around 0.1 or something like that. But that's that's gonna be up to the algorithm to do that for us. So this capability then. So all we need to do is go back to our spreadsheet and we're going to rerun are binning step.

 

I'm going to ask it to create the bins for us this time.

 

And I only want to start at 0.1, so 10% as my first bin and then, so basically everything less than 10% goes into one bin. But you can go into smaller and smaller bins based on the allele frequency's you have in your project. And then I'm also going to bifurcates each of these bins by the LD score. And let's see if we have a pretty good separation of our markers. So now I have the same data, but now I have these bins being defined. So I have what would be my MAF bin in my LD bin and then sort of the overall bin. I could see the distribution of this by going to right click and saying give me my value counts. So of my 1500 markers you get to see that, you know, most of these bins have at least 100 samples which is good. Or a 100 markers or so. You don't want to have too few markers. You're not going to have a good relationship matrix built off of that. So this is good. I think we have a good separation here. Number bins, number of markers, etcetera.

 

So that's good. We have this defined.

 

The next step is we're going to run our GBLUP analysis with this new capability that uses the bins. So the first thing it's going to ask us for is to define which spreadsheet contains those bins. And this is where, again,SVS is very flexible. If you wanted to defined your bins using any other method, go for it. You know, there's plenty of other capabilities in SVS to generate separation of genotypes by other things. You could use annotations against things like thousand genomes of allele frequencies, et cetera, et cetera. But I'm going to choose the LD scores overall been as my binning mechanism. Now, the other thing is we're gonna compute our genomic relationship matrix for each of these bins. We're going to do it on the fly. If you're going to run this analysis multiple times, you only have to do that once because then those bins are defined. Then you could change. Maybe, you know, your covariates or change, maybe some of the other numeric methods that you're running with. That's all fine. We're not gonna have any covariance in this analysis of really straightforward. We're going to let it compute the bins for us. In a way it goes. So we're computing genomic relationship matrices. We're going to get one per bin. So we should have about 10, I think. So here they are starting to fly up and then the GBLUP analysis runs that incorporates all of those and incorporate those into a model. And I had a selected case control phenotype here. Now, again, this is a bit of a simulated phenotype. I don't expect a lot of very fun, interesting things to go on. But we should be able to see this type of output that we're expecting. We're expecting something on the sample level. We're expecting something on the on the marker level. This is an extra kind of debugging output that gives you sort of our understanding of how these each things have a variance covariance matrix associated with them. Well, let's look at this, so this is the estimated marker. Come back to that a second. But one thing we do when we output these things is we give you a lot of sort of summary information in what we call or no change log. So all of these spreadsheets have the same information here. But this is also where you get some of that that overall statistics computed. What we're actually looking at here is kind of the variance explained by each of our genomic relationship matrixes. And then sort of the sum of all those here. And this ratio here is actually that estimated level of heritability of the trait that we created here. So, you know, if I was running this with height and BMI, I might be able to see how well my trait does estimate the is estimated to be explained by the genomic data in my spreadsheet here. So in this case, we have very low 0.15. But, you know, forgive me, it's a simulated phenotype. So before we jump into all the results, I do kind of want to look at these spreadsheets. So these spreadsheets are the genomic relationship matrix is for bin one, bin zero, bin one, et cetera. We're computer scientists, so we started zero, but they should be a little bit different. Because they're completely different markers. And the markers have a different property. Some are more rare. Some are more common. Some are in high LD. Some are in low LD. So let's take a look at this. If I just display this relationship as a matrix, you can start to see, yeah, there's definitely a family pattern going on here at some of these samples related to others, some of them less. There's some striping expected. OK. So that's interesting. Notice how checkered it is those a lot of sort of individual red and dots inside of this region. Let's go look at the next genomic relationship here. This is the one on bin one, right? So different set events, very different. Although you can see the overall structure is the same. But the the intensity of each value and there's different values being involved here. And this is kind of what you expect. You know, there's different things being picked up by each of these GRMs, but they're all being put into the analysis as well. And then finally, if we look at our estimates. So we have our our sample level estimates and our marker level estimates, we can look at these allele substitution effects. We can actually plot these in genomic space. These kind of feel like you're looking at P values. Of course, they are a different thing that they're telling you. They're telling you essentially how much of the trait is being predictable by this individual marker. If you were to take this out. And that's what the substitution effects kind of means. How much that would change. But you do get to see this nice Manhattan plot style plot here. And if I had more markers, you know, you could color it by chromosome and all that kind of good stuff, too. But anyway, this is the type of output that you take forward to then, building predictive models or doing other types of things. Understanding those associations over the genes that they cover, et cetera, et cetera. Back to our slides here. That's a bit of a demonstration of this capability. It's very fun to kind of run through it quickly, of course. It's a very different experience when you're running this with your own data can be a little more time associated with running this. But there's also a lot of chances to sort of stop. Take a look at the data like I was doing, look at the histograms, understand what's going on. And that's part of SVS unique ability as to be this interactive, visual, exploratory environment where you really get to control this and visualize the results at every single step. So other capabilities available in SVS that have been added in this new release, I want to touch on these. We've always been you know, we always add things. And this is part of what you could expect by having a subscription to any of our products. Is that interactive support and of course, when something that doesn't quite match an existing capability, we can improve it and add things. One thing is, as we've added the ability to predict phenotypes from existing results and to have that work on GBLUP output like we just saw. So that's very exciting. We also added the ability to support liftover during import on a VCF file. So if you have some older data that's based on GRCH36 and you want to sort of import it and use it with all the latest annotations on thirty seven, we fixed that for you on import using these leftover chains. Similarly, you could go from 37 to 38 or even back from 38 to 37. We don't care. We also support importing the family data structure that SVS uses. So SVS capabilities that are aware of the relationship between parents and children and the family I.D. and which samples are in which families. You can specify that during the import wizard instead of a separate step, kind of saves you a little bit of time and makes a lot of sense as you're going through our import wizard. And of course, so many updates to our annotations. That's too many to list here. But we're always doing this. You can look at our blog and you'll see these little every once in a while, we'll tell you about some new capabilities, updates to some of our advanced annotations and part of our package our clinical variant annotation suites. We have things like OMIM and CADD and things like that. And of course, just across the board, lots and lots of annotations being updated all the time and other small improvements. So, as promised, I wanted to give you guys a quick update on how SVS is used and obviously very topical in analyzing SARS COV2, which is the virus, of course, responsible for COVID 19. And then I have a little bit of some news to share as well in terms of ongoing publications in this space. So first of all, the analysis strategy here is often just as, again, understand the population structure, much like we're talking about with these genomic relationships matrixes. But in this case, we really arent concerned about, you know, are we are removing sort of association with traits or whatever. We really just want to capture the majority of the population structure. And PCA is a fantastic tool for doing that. So the strategy here is to import the variance called from whole genome sequences of SARS COV2. And this is available in some public repositories. A lot of people are sequencing the whole genomes. There are kits to do this off of NGS sequencer machines. There's ways to share this data back to public repose. And you can grab this. And then we want to normalize and bring that data and do sample level QC, do variant level QC, drop things that are poor quality. But then you can do some fantastic kind of PCA analysis that gives you, by overlaying things in color, a good sense that there are, you know, these clusters and strains that are evolving in the virus as it spreads across the globe. This is actually a picture across three PCA dimensions showing you that that that that rich dimensionality that's being picked up by SVS. This is actually a plot producing SVS from the PCA analysis done in SVS. We also support hierarchical clustering. And of course, if you're digging into how a given variant might impact the virus or just try to understand where variants occur and what type of variance are being occurred and different strains, we have annotations, we have conservation scores. We have descriptions from UniProt about these proteins. We have the reference sequence in the gene models and we can do also annotations against the public samples. So you have a sense of which mutations, if you have your own samples, which mutations are common, which other samples have they been seen in where those samples were observed, where they observed in Utah, where they observed Germany, all this kind of stuff. So all that's available in SVS. So please reach out to us if you want to have a demonstration of how this works, if you have your own SARS COV2 annotation data or genome data. We're happy to show you how that works. And then in the exciting news column, I really wanted to share that we've had a number of press in this space. Our CEO Andreas Scherer has been featured in three different journals providing these detailed articles that he authored describing this type of strategy of analyzing the COVID 19 genomes. So there is these three papers diagnosing and tracking COVID 19 infections, Leveraging Next Generation Sequencing. And this clinical OMICS paper, which covers Golden Helix in general, but also spends some time discussing this topic as well. And this Clinical Lab Manager paper Leveraging Next Generation Sequencing Technologies in the Fight Against COVID 19. And then I want to share with you also that we just had a publication accepted. And so this is done by Andreas, of course, as well as other some other folks. Golden Helix, including myself and James and Darby. So people you may interact with if you've been contacting us on the support side, as well as a collaborator in Germany, Christina Scherer as well, coming from the hospital and pathogen expertize. So this paper describes actually what you just saw in picture form, and it has some of those figures in it. So how to use SVS to do principal component analysis, but go check out the paper when it comes out, it's been accepted. It'll be published in the European Journal of Clinical Involved Medical Sciences. And it's really quite interesting the amount of analysis and trends that we're able to see just by using this strategy. And of course, if you're interested in this topic, we have a fantastic e-book that covers summarizing COVID 19 in general, how next generation sequencing could deliver significant insights and how Golden Helix software. And I mentioned SVS a lot here, but also VarSeq can be used as well. And how both of those things can be used when analyzing virus genomes in the space. And if you look in your chat interface from Delaina, you'll get a link to download this e-book. So please go ahead and use that. I would love you guys to take a look at that. One more time, I just want to acknowledge the support we received in the form of grant awards from the NIH. It's truly an honor to receive the multiple awards. And we appreciate the investment the NIH is providing in the production of high quality tools for genomics community. We have some time left, and I'd love to take a couple questions if we don't get to all of our questions. We will follow up with you via email, etc.. And we often put a blog, for answering these questions as well. So fire up that chat interface and I'm going to start going through these on my screen. And Delaina can answer them as well. So we had a couple good ones here.

 

Yes. Gabe, it looks like we have a few that came in. I'll give you a second to read them. And then also for anyone in the audience who's interested in asking a question, you can see on your screen there, you can take this into GotoWebinar.

 

First question. What species are supported and SVS?

 

Yeah. So, like I said, it does cover a whole spectrum. I always enjoy jumping on support calls or being involved in the support process. We're always curating new species and those come in through support. So if you are doing some work on a species that doesn't yet have a reference genome in a gene model etc pulled in, we often go out and help you get that information. But it's I would say you eat your bread and butter species are often these types of large animal production species. So we have a lot of things for bovine and people actually on the veterinary sciences doing work with dogs. And of course, plant genomics and breeding programs as well. And of course, humans and everything. So there's probably a very good chance of you doing research in agrigenomics that we already have your species curated. But if you open up our software and there's a dropdown and you can see all the genomes available. There are a few questions in there that you want to jump in there and just, yeah, there's a I mean, a lot of I mean, whenever we do a topic like this, which is very open ended, we're also going to have people kind of interested in how this applies to their specific form of research. And of course, I'm not an expert in every one of these. There there's one question about has this type of analysis been done in multiple sclerosis or oncology? So I think this is a fairly new method. And in fact, I haven't double checked how many times these have all been recited in new publications. I did find a couple of good follow up discussions where people are very excited about the potential for this to cover complex diseases. So that certainly can cover a lot of these different types of things they talking about. Oncology is obviously a very different thing. We also have fantastic support for the type of precision medicine workflows you expect when you're doing things like testing tumors for the potential to inform through its biomarkers the treatment of care or the prognostic predictions and things like that. That's something you would expect you would want to look at our our VarSeq VSClinical suite to support. But anyway, the overall answer here is that, you know, you guys, maybe the next generation of papers coming out, you may be producing those by using something like SVS that has this method implemented. I think it's still very early strategy and I'd love to see it play out and see how well it performs. OK. I think one more question I can answer real quick is when is this capability available and what version of the software is it? So this is you just saw essentially what what we're calling our release candidate. We're doing internal testing on this this month. We expect we'll probably have this ready for next month depending on internal testing. But this is a completed solution. We're really just working on making sure it's well documented. We banged on it. It'll be coming out in the next version of SVS and really look forward to that announcement. You'll see it on our blog. You also see it within your SVS software. Of course, there's a little notification when updates are available. That's it for now we can cover some more of these in a follow up.

 

Exactly. Yes. We will be sending out a recording of today's presentation slide deck. And then anyone who does have a more detailed question, we will answer via email as well. But that's it for you, Gabe. I just want to thank you for giving this excellent presentation. We always love having you on here. And then also to everyone who joined us today.

 

All right. Thank you. It was a real pleasure.

 

Thank you. Hope everyone has a great rest of your day.