Webcast: Emerging Methods in Whole Genome CNV Association

Presenter: Christophe Lambert, PhD., Golden Helix CEO and Co-Chair of the FDA MAQC CNV Team

Date: July 23, 2008

Presentation

Well good morning, everyone.  I’m Christophe Lambert, the president and CEO of Golden Helix, and I’d like to welcome everyone to our presentation on emerging methods in whole genome CNV association.  For some of you who may have attended some of our previous seminars, if you remember last fall we gave a presentation on some of our methods for whole genome copy number association.  We put out the word to people that we were very interested collaborating on whole genome copy number association studies, and we in fact had about 20 applications from various groups saying, “Yes, we’re very interested in working with you on whole genome association.”  And so we thought, work with a number of groups, and have ongoing studies in copy number variation, as well as we’ve looked at a number of public data sets, and so all together we probably have access to about 25-some odd studies, and currently we’re in the midst of about 20 of them.  So today’s really going to be covering some of the methods and learnings that we’ve had as we’ve been doing these studies.  We’re not going to be disclosing specific details of any particular collaboration results, because obviously those need to be published and so forth, but we will be talking about some preliminary results analyzing of Wellcome Trust data that we’ve had access to through our work with the MAQC group, with the FDA, that’s looking at sources of variation in SNP and copy number association studies. 

So with that introduction, I’d like to just motivate copy number association a little bit.  It’s really interesting that it appears to be the case that there’s more variability in the whole genome as a function of copy number variation, both deletions and amplifications of small regions or large regions of DNA, and there’s really the promise that with combining both SNP information and copy number information, we can get much deeper understanding of disease.  Of course, we’re familiar with diseases like Down’s Syndrome, where there’s a trisomy of chromosome 21.  These very large regions are, might even be visible in a microscope.  However, what we’re really interested in is how do we look at thousands of samples, and find these very small regions that might be implicated in disease.  So, copy number variation association, the approach we’re taking is very analogous to whole genome SNP association studies.  You’ve got these markers, and in many of the newer arrays, both polymorphic and non-polymorphic markers, and the some arrays that have been used for making genotype calls, you can also get copy number variation information.  Albeit, at a very low resolution.  So what we’re looking at is sort of traditional statistical association.  You can think of, you’ve got a phenotype as a Y, and your X matrix is a bunch of information on copy number variation.  It might be continuous Log-Ratios, which is comparing the intensities of given SNP or CNV marker against a reference sample.  You could discretize it into 0, 1, 2, 3, 4 copy number states.  The approach we’ve been doing, mainly because the resolution of the data seems to be such that all we can really do is look at loss and gain in neutral states, but these are all different approaches you can do to construct the co-variants for association.

So, if we thought it was challenging to do a whole genome SNP association study, copy number variation association is probably an order of magnitude harder.  We’re familiar with really good call rates with SNPs, 99.9% in some cases.  At a probe level, is probably less than 80% accuracy.  You really need multiple markers in a row to make accurate copy number calls.  And we’ll show that a little bit later.  So there’s ambiguity in the copy number calls, and very little ambiguity in the SNP calls.  With SNPs, we can use sparse storage versus taking a full floating point number to start the copy number variation association.  So if the data set sizes are about an order of magnitude or more, 16 times larger, there’s not only false discovery problem with looking through multiple hypotheses across the genome, say 500,000 or a million tests.  There’s also the false discovery challenge of actually making the copy number calls, something that’s not really so much of an issue with SNPs.  What we’re seeing – and a fair amount of time on this, when you’re doing these large studies as opposed to just looking at one sample and visualizing it, there’s a very major problem with batch effects in almost all of the studies we’ve seen.  And part of that could be mitigated through better experimental design, but to a certain degree it’s much like gene expression analysis, one of the curses of this type of data processing.  And we’ve argued that it’s getting a lot better, but we’re still in the early stages of establishing methods for whole genome CNV association, and really this talk is showing our current learnings and where we’re at on really finding the associated signals within whole genome CNV studies.

So just a quick overview, some of the studies we’ve been looking at.  The Wellcome Trust data is Affymetrix 500k, there’s 7 different diseases.  Bipolar, coronary artery, Chron’s, hypertension rheumatoid arthritis, and type 1 and type 2 diabetes.  Approximately 2,000 cases in each study.  1,500 national blood service controls.  The original Wellcome Trust study used another 1,500 controls that they haven’t made public.  Also, we work with a number of collaboration partners, over a dozen now since last December.  Got about 9 different population studies ongoing, and several others in the pipeline.  Everything from Affymetrix, 500k, Affy 5.0, 6.0, Illumina 370,  1M, even combinations of some of these.  Everything from 300 samples to 4,500 samples.  That 4,500 sample, Affy 6.0 study is very interesting.  You’ve got about 42 gigabyte process matrix starting from close to a terabyte of raw cell data. 

We’re also looking at a number of family studies.  We hope to be presenting a seminar specifically on family studies in a month or two, as – and this presentation will be primarily focused on population base association study.  So all the analysis we’re going todo today in displays has been done with the SNP and variation suite, and you’ll see more of that as we go forward.  Now, the general outline of the talk is to really walk through the workflow we’ve been using for association studies.  Starting with getting your data in, doing quality assurance, correcting for batch effects, assessing these batch effects, then doing association tests on the Log-Ratios.  Looking then at segmenting the data and making copy number calls, and doing associations with data related to those copy number segments.  And as we get towards the association test methods and so on, we’ll highlight with some of the Wellcome Trust data some initial results.

So on data import, I’ve given another talk on this in the past that goes into great depth of just all the initial processing that needs to be done.  It’s not something to be taken lightly.  You really have to carefully think about it.  We’ve taken approach for the Affy-data, pretty much analogous to what they’ve been doing in their genotyping console.  One of the reasons we had to reproduce all this analysis is there are some limitations on the software that could only handle about 400 samples.  You could rig it a bit with large memory configurations to go to maybe 600, 700, but we’re looking at thousands of samples, and so we built quantile normalization approaches analogous to what Affymetrix does.  One key thing, though, that – you have to make sure when doing quantile normalization not to incorporate the X chromosome data in, or it shifts the means of males versus females of the distributions just slightly, but it’s enough that you’ll see spurious associations on gender if you were to do that.  And so, virtual array generation’s another method that’s important of how do you merge your NSP and STY channels for the 500k, or merge the copy number and SNP probes, which have to be treated slightly different in terms of how they’re normalized.  We talked a bit about normalizing the Log-Ratios against reference populations, and we got to build the infrastructure, really, to deal with four, five thousand samples, some of our largest studies, and as I mentioned earlier, you’re talking about literally a terabyte of data for some studies that you then process down to a 1.8 million by 4,500 row and column matrix containing the Log-Ratios.  And so we really had to build the infrastructure to deal with these really large memory things, and do it within the footprint of a regular windows machine, by basically taking off the shelf Dell hardware, and been able to analyze this type of data overnight.



So, a lot of people asked the question, “What’s the appropriate choice of reference?”  In some cases, in small studies, people will be say, using an external reference like the HapMap samples.  Ideally for large studies, you want to be using the studies themselves as their own reference.  Just to understand what we mean by a reference, you get an A allele and B allele intensity for each SNP.  You take a median across for a given SNP, out of the 1.8 million SNPs, say the Affy 6.0.  For a particular SNP, you get the median A and median B intensity for your reference population as your denominator, your numerator, is the A and B intensity for a specific SNP or copy number probe.  And you take the Log-Ratio, and thus if they’re equal it would be zero corresponding to, presumably copy number neutral, or at least what the median of the population is.  We’ve done some testing on several studies, and looked at using just controls as a reference, which seems theoretically like that might be more appropriate, because in the cases you’d expect copy number variation.  However, we found that you get very similar results whether you use all of the samples, or the controls only as a reference. 

This is a plot of an association test on Log-Ratios.  The correlation between the response on one axis is controlled only as reference, and the other axis is all samples with reference.  Highly correlated.  If you do the same plot of P values, especially at the low end of the highly significant P values, they’re extremely tightly correlated.  On the ones that are a little less, that are not significant and hence uninteresting, the correlation’s a little worse.  So it’s not identical, but because of some batch effect issues, which we’ll talk about later, we’re really leaning towards using all the samples.  And the reason it’s not such a big difference, a) think about that these are really large samples.  Copy number variations aren’t that common except in certain regions, and so the median is going to be skewed towards your neutral, or whatever is – recognize that neutral is not always a copy number 2.  Sometimes the most common thing in a population is copy number 3 or copy number 1.  So that becomes the reference.  So there is a little bit of a problem with, that zero necessarily means copy number neutral.  But when you take a median, whatever that most common thing is, that’s in controls as well as most of the cases as well, is going to tend to be your copy neutral reference.  So it doesn’t matter so much with a median.  With a mean it would be more problematic.

Also, as I said earlier, this is just a plot.  It comes through a little fuzzy, but if you had included X in your quantile normalization, you’d get spurious associations on gender with your Log-Ratios.  Now, if you don’t do that, there still are for some reason, probably a few, couple hundred across the whole genome, markers that still seem to be associated with gender.  We’re not sure if that’s just some SNPs that have been misclassified in terms of their position, or they have a sequence that’s somewhat in common with the X probes, or perhaps some sort of a funny match effect where genders are not really randomized on the plates.  So, just something to be concerned of in terms of quality control.

Now Illumina platform’s another one we’ve done a number of studies with.  The approach is similar in terms of taking a ratio of your intensities versus the reference population.  They use formulas for basically polar coordinates, getting a Log-Ratio.  We’ve tried using the other workflow on the Illumina data.  We didn’t have exactly comparable comparison of a full 270 Illumina HapMap samples, we didn’t have access to that, and that’s what they used, presumably, for their references.  So we’ve just been taking the actual data from the Illumina BeadStudio platform.  But a key thing to note, if you’re using the Illumina platform is, though by default the reference population they use is a HapMap, and that’s fine for a small study, but you have to recognize that the HapMap samples were done at a different time on different machines with different temperature, potentially different length of time of the mortalization of the cells versus whatever your particular study is.  So what we looked at here, this is a thousand sample study, and we used on the top one, internally normalizing against the sample and the study, versus the bottom histogram here is normalized within the HapMap.  And this is the X and Y, sort of the X intensities, a histogram, between males and females.  So you’d expect copy number 2 for females, copy number 1 for males, that there’d be a shift in the distribution.  Ignore the X-axis here, this is actually not a Log-Ratio but a different scale that we were using when we built these histograms.  But the key thing to note is that the difference between males and females is much starker when you normalize within sample.  In other words, we’ve had a full thousand samples as reference when they were all drawn from the same population.  When we use HapMap, there’s a much larger overlap, and hence that’s going to cause a greater signal to noise problem with when you normalize within the HapMap.  And so there’s a simple option to re-cluster all SNPs within the Illumina software BeadStudio, and you can get better results.



So, let’s talk a little bit more about quality assurance.  One key check we do is, we look at the mean value of the X and Y intensities for males and females, and see if they check out with what is reported gender.  So here’s a histogram of the means of the X chromosomes, and so each sample represents one observation, so there may have been, say, a thousand, or a couple thousand samples here.  And so you notice, there’d be the males with the low copy numbers for X, and the females for the high on average.  And then there’s some in this no-man zone that are potentially problematic samples.  And so we potentially drop those, as well as, you’ll see occasionally, on the next slide here, we can do both X and Y if your ray has both X and Y probes, and you’ll see here’s the males up here, and the females over here.  This was the HapMap study.  Here’s sample NA10854, where the X chromosome intensity is not over here with the other females.  It’s actually closer to males, but it’s also not centered around males.  Turns out there’s some apparent X mosaicism, where there’s 68% of the cells have copy number one, and the rest have, presumably, copy number 2.  And so this was apparently healthy female individual who had 8 children, and yet there’s something funny going on within this sample.  So we may drop some of those samples where they look particularly problematic.  Another thing on data quality is, we’ll look at the overall call rates, and different platforms have different suggested rates for SNPs that represent when they would redo the experiment.  So we typically go with those thresholds.  So that’s like your overall call rate of SNPs.  So this is just whether or not you’d keep a given sample.  If it had a 50% call rate, you’d probably drop it.  If it’s 85, 95%, or better, you’re in a good shape.  Also, another approach.  We haven’t done this ourselves, but we’ve seen some people filtering samples with high Log-Ratio variants.  So if it’s particularly noisy, you might drop it, and likely those would also correspond with problems with call rates.

Turns out, most of the time spent on these analyses, or a large part, really is making sure you’ve got your data clean, and it’s sort of the boring part, but and then you move to the exciting part of the analysis.  But if you don’t do this right up front, it’s sort of a garbage in, garbage out scenario. 

Now, what about dropping specific markers?  Now, those of you who have done whole genome SNP studies, familiar with picking SNPs with problematic Hardy-Weinberg equilibrium, low call rates, low minor allele frequencies, and dropping them from analysis.  And we argue that common CNVs can often account for bad clustering, which could lead to problems with call-rate, which could lead to funny Hardy-Weinberg equilibrium, and so forth.  So we’re not necessarily going to use SNP quality measures as a reason for dropping a marker for the purpose of CNV analysis.  We might consider dropping gender-associated markers.  Our approach, however, is generally to leave the data in, and then go afterwards, after you’ve found significant regions, and just verify that the source data is solid.  You can go look at your cluster plots of the original data within the Affy or Illumina software.  Also, I think we’ve kind of become a little over-concerned with quality.  When you kind of have a co-measurement with the fact that this data really is inherently noisy in CNV land as opposed to SNP land, and that no matter what you’re going to have to follow up the significant regions with some kind of follow-up PCR-type analysis of one sort or the other.

So, that being said, let’s talk about another real problem that we have to deal with in terms of data quality, and that’s batch effect in, as well as problems with population stratification.  The batch effects are real bugaboo in whole genome CNV studies.  They come from several sources.  One is just problems with experimental design.  I think because SNP call rates have been so excellent, we’ve kind of forgotten some principals of good experimental design in, for instance, not randomized case and controls on plates, even, say, borrowing controls from other experiments.  Not that that’s a bad thing in a snitch study, but in the CNV study it’s going to create, there’s going to be major differences between experimental conditions, you did your cases versus borrowing someone else’s controls.  A real bad thing is to split families up on different plates.  So we don’t talk much about family-based studies, but if you are doing a new study, make sure your trios or your extended pedigrees are on the same plate, and that way they’ll be as comparable as possible.  But then there’s sources – so there’s sources of variation that account for these batch effects, where there’s – even at the same site, as you do it over time, things can change.  Of course, it’s often that they’ll be multi-center experiments, since then each site will have its own specific batch effects, and if you didn’t carefully randomize case and controls across these sites, you’re going to see spurious associations based on those differences.

Another thing worth seeing that we believe is going on is non-linear sequence-dependent temperature variation.  So if you do the same, it’s the same sample, at two different temperatures, the intensities don’t just vary in a linear manner.  So that’s one of the real challenges in batch effect correction, is it’s not just some constant factor.  It’s a moving target as a function of the probe sequence.  So, of course cell types, and there could be differences between, say, the lymphocytes, where you’re getting all the blood, versus, say, buccal cells from cheek swabs.  As well as, there could be real differences between immortalized cells and just cells that have been extracted right from the tissue and run because of, it really appears copy number variations tend to accumulate as cells divide and reproduce.

So here’s some recommendations.  You’re doing case control study, randomize.  Put an equal number of cases and controls on each plate.  So if you’ve got a 96-well plate, you should have 48 cases, 48 controls.  Keep families together, and ideally do all your samples at the same site over as quick a period of time as you can without changing the dials on your machines.  As well as, nobody we know has done this, but we think this would be extremely useful for fixing batch effects, is if you’ve got 96 wells, if you could spare a couple on each plate to put the same samples on every plate, a very well characterized male and female sample.  Possibly a HapMap sample, or perhaps one of your samples that you know is not going to have differences in terms of sample type, cell type, and mortalization procedures.  And then that would enable you to both have some reference for the X and Y, for the males and females, as well as look from plate to plate and see the differences on the same sample.  So we might have a shot at fixing the sequence-dependent temperature variation.

How do you know if you have batch effects?  If you take your Log-Ratios and do an association test across the genome, what you’ll see is peaks like this.  This is in a log scale, 10 the minus two hundredth, 10 to the minus fiftieth, all across the genome.  And we see this in Affy, we see this in Illumina no one’s immune from it, and it’s just  a function of these batch effects. 

So how do we fix it?  What we’ve done is – once you’re familiar with the Eigenstrat approach, it’s been used in SNP studies to remove variation in allele frequencies due to population stratifications.  So you have different ethnic groups that just have different allele frequencies, and one ethnic group’s got a different prevalence of a disease, you’d end up finding alleles that differentiate the ethnic groups instead of finding the regions that are significant, that are relevant to the disease.  So, the principal it works on is basically, with different ethnic groups there’ll literally be thousands of SNPs whose allele frequencies shift as a function of that ethnicity.  And when you do a principal component analysis of the covariance matrix of that big matrix of SNP data, you’ll see differences between – show up as, across thousands of SNPs, in the first few principal components.  You can factor those out of the data, and then that removes the ethnic problem, and you can form your association test.  Well very analogous to that, the batch effects represent shifts across thousands of markers.  They show up in the first few principal components when you do the same type of analysis on the Log-Ratios.  You factor them out and you leave the small differences that are associated with the disease.  And one of the bonuses as well, is we can correct for population stratification as a function of CNV at the same time.



So here’s just a little plot of the HapMap data.  270 samples, Asian, Chinese, and Japanese is in magenta, Yoruban is in yellow, and the CEPH samples are in blue.  If you look at the first two components, component one seems to be able to divide out the different ethnic groups, but component two isn’t.  Component two is more a batch effect component.  This is actually the Log-Ratios we’re doing the PCA on.  Whereas we look at component one and three, we really split out the groups.  And the most significant components were 3 and 6 for dividing the three major ethnic groups.  And we look at Wellcome Trust, so if you’ve read the Nature paper on the large Wellcome Trust study by the Case Control Consortium, they’ve characterized quite correctly that the cases and controls are drawn from the same population.  So if you do a SNP-based Eigenstrat approach and you look at the first couple principal components, we see the case and controls heavily overlap one another.  However, if we do the same analysis on CNVs between cases and controls, we see large departures.  And of course they had a common set of 3,000 controls, which we just have 1500 of them.  And they were running a different batch than their seven different case studies.  Now we see here, not only is there batch effect differences between casing controls in the magenta and blue, also we see large clusters within the cases and within the controls that are very different positions.  So there’s even batch effects within the whole set of cases and the whole set of controls.

Question then, how many principal components do you use?  We’ve got a little procedure that we’ve found works pretty well, kind of corresponding with how you would eyeball the scree plot.  If you look at the Eigenvalues of – this was 3,500 samples, so there’s 3,499 Eigenvalues.  There’s a few very large ones, and it falls off very quickly.  It’s kind of hard to see unless you take the log of it.  So if you take the log of it, we found that that seems to be a better approach finding the elbow in the log of the Eigenvalue plot.  And we use a method by Zhu et al in this paper in computational statistics and data analysis, and have a simple way to implement it in our software, where it basically will in somewhat of an objective way pick that elbow.  And surprisingly we find with, and I think it’s due to these temperature specific shifts that are sequence-dependent, that we really have to use quite a few Eigenvalues.  In Eigenstrat with SNPs, if you have 3 populations, you might use 3 or 4 Eigenvalues.  We’re using 50, 100 Eigenvalues depending on the sample size.  So here, this was for the Wellcome Trust.  I think 112 was the optimal number.  And if we look also to get a sense of what’s the right number of Eigenvalues, if you do an association test on the Log-Ratios versus your case control status, here it is with no Eigenstrat correction, 10 to the minus two hundredth P values everywhere, basically.  It’s finding these big batch differences.  If we go to 25 Eigenvalues, we still see that the baseline noise is still at 10 to the minus fortieth.  And here it’s hard to see, but it’s closer still to 10 to the minus tenth at 50.  And sort of – and at about 100 Eigenvalues, it really kind of knocks down.  Now there’s still a lot of peaks here, and an approach we’ve used – I don’t really have a slide on it, but it’s basically, if you do a median smooth of the P values – we typically do a 5 point median smooth.  So you take the given P value and the points, two points on either side of it.  You sort them and take the middle one.  Basically it will knock down any peaks that aren’t represented by at least three out of five markers being highly significant.  And given the noise of this data, although we may believe that a peak that’s only a single SNP, I’m much more comfortable going with the ones that have multiple markers.  And we’ll show a little bit more of this when we get into the association testing.

So the point is, kind of empirically we’ve just looked at, there seems to be on a log scale, pick the elbow and that seems to be an appropriate number of Eigenvalues to do your batch effect correction on.  And nicely, we see some very large signals showing up now that the noise has been suppressed.  And so the good news is, if we’ve got real batch effect problems, there is a way to mitigate it.  On the other hand, if you’re doing a new experiment, hopefully some of the suggestions we have on randomization and experimental design will help you get even better quality data.

A real bonus –even people who aren’t interested in association testing, but are wanting to look at specific samples, if you do the batch effect correction it will reduce the noise, and you get a better signal to noise ratio when you’re going and doing segmentation and so forth.  So this was a HapMap sample in chromosome 22.  Actually, all different platforms, you do see this LOE region at the end of this chromosome 22, for this sample, and it comes out much more clearly with – you see a correction.

Now let’s get to the exciting part, which is doing association testing.  We’ve done some infrastructure for doing trend test, T-test, regression test, either on a binary or quantitative tree.  Basically, these big matrices, we’ve got to have them on discs, so we select a file right now and you can do association across the whole genome.  And then there’s some options for a principal component analysis.  Also, you can export the results of the PCAs for doing further segmentation and analysis.  But typically the result is you’re going to get a set of P values that you can then plot the associations against, against the whole genome.  So here we’ve done the 7 Wellcome Trust studies, and this is a plot of the P values across the genome.  We haven’t really shown the axis here, it’s just chromosome 1 up through 22.  I didn’t do X in this particular study.  And so well talk a little about this later when we go over the Wellcome Trusts in some more depth.  But the interesting thing is there really are significant signals, and we’ll talk a little bit about how they seem to hold up in terms of comparing the literature.

So, another approach besides taking the associations just on Log-Ratios is to actually make copy number calls on the data.  Now, there’s a whole body of literature on different methods for finding the copy number regions.  If you think about it, the challenge is really, you’ve got this comparisons against a reference where if it’s copy neutral, the Log-Ratio should be zero, and if it’s a deletion it should be less than zero, and an amplification should be greater than zero.  If you just do an average within these two regions, you get something like a copy number.  The problem is, for a single probe, you see this little dot here.  Does that represent a very small deletion?  Or is it just the noise of the data?  And so the real challenge is, how do you tease that out?  And there’s a number of methods that have been put forward.  Hidden Markov models were one of the earliest methods, and so they’ve kind of gotten some favor, but in this comparison papers by Willenbrock and Lai, they really found that the Hidden Markov models didn’t perform at all compared to some of the more powerful methods like the circular binary segmenting algorithm. 

We’ve implemented our own approach, which is based on some published work that actually dates back to the early ‘70s that really solves optimally this problem of how do you find a set of break points that minimizes the sum of squared deviations from the means?  The idea is you’ve got a mean with noise, plus minus noise within a given segment, and you’ve got to find the segments that best fit the data.  It seems like an insoluble combinatorial optimization problem, but through clever dynamic programming you can look through all possible segments and find the optimal set of break points.  And so we’ve implemented this approach, and it exceeds the performance of circular binary segmenting slightly in both sensitivity and false discovery rate.  We’ve given some other presentations on this approach, but the point is, pick a good method – we’ve been using our approach, circular binary segmenting is good.  The old Hidden Markov models that first came out that really are not very competitive with the more modern methods. 



Just to kind of again picture this, this is just a cartoon that we did some simulated data here.  There’s a univariate approach, where you look at finding the break points one sample at a time.  And this is something that, most methods do this.  They just find the break points for one sample then go to the next and go to the next.  It’s good for finding individual variations.  The problem is, because of the – you can only detect copy number variations so small.  It’s probably something like 10 to 30 markers is kind of the range of the smallest CNV regions that can be found with most methods, based on the signal to noise of, say the Affy and Illumina platforms.  Sort of depends a bit on the platform.  We’ve implemented really quite a unique approach to multivariate segmenting, where the idea is if there are break points that are shared, and here not necessarily perfectly aligned, but shared by multiple samples, we would be able to find those with more power than just looking at one sample at a time.  And so we found, we’ve done some analysis on say the olfactory receptor, has quite well-conserved region that is highly variable across people.  About half of all people have some sort of a copy number variation in the olfactory receptor region.  And we found in practice with large sample sizes, with a multivariate approach you can actually detect a single sample CNV, and so here’s a histogram of the intensities between, within a single marker.  Either a SNP or CNV probe, here’s copy number 2 versus less than 2, and 2 versus greater than 2.  And so you can contrast the multivariate approach, because you have the advantage of multiple, not exactly replicates, but multiple samples.  It improves your signal to noise, as long as there’s a conservation of the region.  Which isn’t necessarily the case.  In practice, we’ve been using a lot of the univariate approach for some of our association testing, and the multivariate approach to sort of go up and follow up on certain regions, and really home in.  Is there a conserved consensus region? 

So with all this in mind, how do you turn this segmentation into some sort of covariate that you can do an association study on?  So what we’ve done with the univariate approach is you take all the break points, and then any time a new break point is introduced cross any one sample, you introduce a new covariate.  But the value that you store for the covariate is just the mean within the segment for that specific sample.  So sample 1 here, there overlaps several covariates, the ones introduced by sample 3 and sample 4, but the mean value is 1 for all of those, and then it shifts in sample 1 down to a mean of zero, and so we put zero all the way up to this point, and then minus .5 when it shifts down there.  You notice some of them aren’t nice and clean plus 1, minus 1.  There’s things slightly less than zero, slightly greater than 1 potentially.  But anyhow, that’s a set of covariates that you can actually just do an association study on.  The advantage of this is you haven’t – there are plenty of ambiguous segments that it’s not – in real data, in the noisy data that we see in reality, it’s not clear that you can say it’s either copy number one or zero or two.  And so we – you can do an association test simply on those means, and it seems to be useful approach.

If we discretize, on the other hand, and what this is a histogram of the mean values of the segments across one of the chromosomes in the Wellcome Trust data.  We do see a nice triangle distribution.  There appears to be losses, neutrals, and gains here.  If you can’t see the axis, this is zero right here, and so the thing to note however is there isn’t overlap, but the tails of these distributions.  So any kind of descretization you impose will by definition misclassify some of the segments.  So, what we’ve – it’s kind of an open question, what’s the best threshold to pick?  Should you pick where you’re absolutely confident it’s a deletion, or do you accept a certain amount of overlap of misclassified deletions and amplifications?  The approach we’ve been using so far is basically picking the valley, and picking a cutoff there, and creating a three-state model.  Alternatively, we’ve also done two-state models where you do deletion versus not a deletion, or a gain versus not a gain.  So this spreadsheet that we started with, where we use, we have the segment means, we can then discretize it into a three-state gain loss neutral model, or a two-state.  Or you might just want to know, is it a deletion versus not a deletion?  And doing an association on those.  We find they’re complimentary approaches, and we’ve been doing all three, all four scenarios. 

So, if you look at the Log-Ratios versus – and so on the X-axis here is the Log-Ratio intensities, the Y-axis is the case control status with a moving average.  We’ve seen an interesting phenomenon in many cases, this is sort of a sweet spot phenomena where sometimes there’s a sort of a range of copy number variation that that has the highest incidence of cases in this case.  In other cases, there’s a sweet spot here, probably around copy neutral, where you have the least incidence of cases, so it’s more healthy individuals.  And then on either side, either gain or loss can produce either, in one case a greater chance of being a case, or in another a greater chance of being control.  So it seems to go both ways.  Sometimes you’ll see a linear relationship, more of an additive model, but it’s really quite common to see this.  And there is some support for this in some publications.  For instance, there’s a nice presentation in a recent science webinar, Charles Lee was talking about there was a particular susceptibility to Chron’s disease, psoriasis, where copy number four was kind of the best case scenario, and then on either side it got worse for you.  Might be vice versa, but, so first we’re kind of concerned with this, but it seems simple additive models aren’t necessarily always the best way to do the association study. 

So, let’s now go a little bit more into some real data, and show you a bit about what we’ve been doing the Wellcome Trust findings.  We’re still trying to get this stuff written up and so we’re only going to reveal a little bit of the associations, as well as we’re going to be doing a more detailed talk in the fall at IGES.  But again, looking at this plot here, here’s these association tests across each of seven different diseases.  Notice here, right along this line, all the diseases except hypertension have quite a significant peak.  This is in chromosome 7.  There’s also a secondary peak in chromosome 7 that shows up in 6 out of 7 diseases, as well as in chromosome 14 here, there’s a peak that shows up highly significant in 6 out of 7 diseases.  Now, there’s also plenty of other regions that are unique just to specific diseases, and when we look up these regions, either the genes they’re in or close to, about 30 to 40% of them had previous publications in terms of whole genome SNP studies or candidate SNP studies that had found significant regions.  So for instance, there was a region here in hypertension that was associated with pregnancy-induced hypertension.  It was the only signal we saw, and it was actually, there was support for it in previous studies.  So there is some nice evidence.  These are just Log-Ratio association tests, but there’s some nice evidence that these things are holding up in other studies. 

Now, these three regions, however, that I highlighted, that show up in 6 out of 7 Wellcome Trust diseases, I really want to highlight this to all of you that are going to be doing studies, that we’re not only seeing these in the Wellcome Trust studies.  We’ve seen these in the majority of the studies we’ve been looking at.  One is in chromosome 7, this TARP gene.  Another’s in chromosome 7q34, and another in chromosome 14q112.  They’re all related to T cell receptors, and the thing is, are these – we jokingly call them Job’s genes.  Remember that fellow in the Bible who had all these plagues and illnesses on him?  Are these genes for all disease?  Well, it seems kind of unlikely.  Here’s a plot of this chromosome 7 region, it’s really the same region time after time.  In 6 out of 7 diseases.  Here’s in chromosome 14, near this defender against cell death gene.  What is it?  Is it some sort of artifact, or are these the genes for all disease, or is it something else? 

So, there is some studies they’re not showing up in, interestingly.  There’s – by the way, it’s both Affy and Illumina studies were seeing this.  Christoph Lange and others published a paper, I guess the lead author was Iuliana Ionita-Laza, where they were doing a family-based study and they – oh I see, they found it in one – sorry, there’s two bullets that should be here.  So this genetic epidemiology paper found a significant region, and it was in childhood asthma, and then there was another study that one of our collaborators have done, where we’ve been able to confirm in Illumina that this chromosome 14 peak was very strong on one of these Wellcome Trust diseases.  So we’ve also been looking at a couple of infectious disease studies where we’re seeing these associated regions, and many of the other studies we’ve been looking at.  We’ve looked at comparing two control populations between two Affy 6.0 studies.  We did not see those regions.  We’re wondering at this point if it’s an age effect, where there might be an accumulation of some somatic mutations and hotspots.  So what happens is as your cells divide, there could be regions that are susceptible to copy number variation, so the older you get the greater chance that you’re going to have copy number variations, and so is there just enough of a difference between case and controls – say, particularly with something like type 1 diabetes, where it’s childhood diabetes, where there’s a difference.



Or, alternatively, is it something where the cases in many diseases are experiencing some sort of non-specific stress or immune response?  Almost all the studies, you’ve got to remember, they’re blood, and we’re getting lymphocytes, the white blood cells, and these are – there’s T cells, there’s B cells, could there be differences in the proportions between case and controls?  Word remains to be seen, and we’ll have to do some more work to really figure out what this is.  So just be aware if you’re doing your own study and you find chromosome 7 or 14, these two regions, these three regions rather, don’t rush to publish necessarily that you’ve found something unique to your disease.

Let’s do a little comparison on type 1 diabetes.  Here’s the SNP association we did on just the 1,500 controls, so this is going to differ from the Wellcome Trust study, because I had 3,000 controls.  Here, just to make an apples to apples comparison, we didn’t do any quality control on SNPs, so you see a rather large number of significant peaks.  If you do a median smooth on the SNPs, you do get these regions that have multiple markers within a gene or within a region significant.  Similarly, we compare this against the copy number variation association on Log-Ratios.  I was hoping to show the copy number variation association today on the whole genome for the discretized data, but a power failure prevented that from happening.  We had trees knocked down next to our office here.  So, but the interesting thing is, for the most part, although there may be some regions in common, there’s different associations with CNV then there are with SNPs.  And so there’s – it’s an interesting thing that we can really find more potential reasons to explain these diseases.  They compliment one another by doing both SNP association and CNV association.  The good news too is for these diseases like type 1 diabetes, these regions in chromosome 7 and 14 are not all the signals we’re finding.  There’s some very significant ones in some other chromosomes that are actually, some of them corresponding with previously published results on SNP studies.

Let’s just zoom in on one interesting region.  This is chromosome 22 of type 1 diabetes.  Here’s the log-R associations with no smoothing.  Here we discretize the Log-Ratios using the approach we described earlier, with a loss, neutral, gain, and did associations.  We didn’t do any smoothing, but really knocked down all the spurious associations, and you really see a nice, clean peak.  By the way, in some studies we’ve been looking at, not of course in this chromosome of type 1 diabetes, but in other chromosomes as well as other studies we’ve been looking at, we do find that these are complimentary.  We’ve been seeing that the discretization does give you extra regions that you might not have seen just doing Log-Ratio associations.  So they’re complimentary approaches.  The region in chromosome 22, GGTL4, it’s amino acid metabolism, which is something that seems to potentially be related to type 1 diabetes. 

We’re still concerned that this might be one of these funny things with chromosomal rearrangements.  Remember your cases and controls are different ages, so we’re going to have to look into this a little further.  But when you use a continuous covariate, and just find a sort of optimal breakpoint to have differences between the means of the cases and controls at different Log-Ratio values.  We found, for the Log-Ratios, less than minus .139.  There was 278 cases versus 21 controls, and that was this 10 to the minus fiftieth P value.  We discretized, we actually picked a cutoff of plus minus .1.  We had 600 cases that had computer deletions, and 200 controls.  So quite a difference.  And so one other approach, just kind of to wrap things up, we’ve been using, this is just sort of very preliminary work. 

So what if we take type 1 diabetes or any other disease in the Wellcome Trust study, and compare it against all the other diseases?  Now, we know they’re not going to be perfect controls, but we essentially can get, in addition to our 1,500 controls, another 1,200, sorry, 12,000 non-type 1 diabetes cases for other diseases.  And so the first one here is just the original case control study on CNV, and here it is when you compare against your 13,500 sort of controls.  And we see a lot of the same regions are still significant, but then there’s additional peaks that show up that could be of interest.  And so, also this showcases that we actually can handle a study of 15,500 samples.  We quantile normalized all of them simultaneously in one batch.  We did all the normalization and so on.  So you have the advantage of really being able to sample even rather rare alleles when you’re doing all your pre-processing.  So that’s kind of a lot of the results we’ve had to date.  There’s some very interesting stuff we’re doing also in family-based studies, finding de-novo mutations. 

I hope the take-home message is, there is interesting signal to be found in existing whole genome SNP studies by looking at CNV.  We’ve got a lot of them ongoing, and anticipate getting them wrapped up in the next few months.  We’ll be presenting more at IGES.  I’ll be speaking at a conference, I got a 15 minute slot to try and cover some of the highlights, and we’ll be talking about some of the specific genes and regions we’ve found in these studies, and as I said before, we’ll be talking later some about what we’ve been doing with family-based studies.  There’s some distinct differences.  We’ve also built, Christoph Lange from Harvard has built into PBAT the capability of doing all the family-based association tests of PBAT, including time to onset multivariate outcomes and so forth within the context of copy number analysis.

So we will in the future be very interested in working with both collaboration/consulting opportunities in additional studies, for those of you who still would like to share in and gain from our experience in doing many of these studies, as well as get some help doing this analysis.  And so we’d like to invite you to come check out our website.  We’ve got a number of additional webcasts that talk about different aspects of CNV and SNP analysis, and there’s various special scripts that can do different processing of CNV data to do.  Almost daily we’re producing some new analytical method to help us.  It’s really forcing us to do the best possible analysis methods by essentially eating our own dog food, using our own software to do this analysis, and proving it over time.

So with that, I’d like to thank you all for joining us, and like to open it up for questions.



So, someone was asking, talked about Log-Ratio throughout, and not B allele frequencies.  B allele frequency, not too important detecting and analyzing CNVs in the SNP study.  So the thing to recognize is that a Log-Ratio really does incorporate both the A and B channels, and so there’s a ratio, but there’s also the B allele frequencies is essentially embedded in that.  We tried to do some stuff with B allele frequency to see if we could really get more out of the data, but it didn’t seem to be a very profitable approach.  That’s not to say that others might not be able to be successful with that.  So we’ve generally focused on the Log-Ratios for the association studies.

So, could you comment on CNV QT association?  Quantitative trait, we’ve mostly been doing case control studies, but essentially the quantitative trait analysis goes in much a similar fashion.  Typically you’re looking at either Log-Ratios or discretized values, and doing a trend test or a linear regression or a logistic regression.  Well, I guess not logistic if you discretize your quantitative trait.  But essentially we’ve been doing just a little bit of that, and it seems to work well.  Now, one of the questions that comes up too on quantitative traits is, remember when we talked about randomizing casing and controls on plates, that’s not necessarily so applicable, though there may be implied in your quantitative trait some higher and lower values of quantitative traits.  So think about randomizing those so you’re not skewing your analyses. 

So questions, why are autosomal markers associated with gender, and is it essential to delete them?  Yeah, I guess I said in the talk, we’re not exactly sure why, whether it’s a sort of similarity in sequence between some sex markers, or whether it’s some sort of a batch effect, or that these really are sex markers, that our marker map information is just incomplete or incorrect.  Is it essential to delete them?  No.  We’ve been using them, and then you just want to carefully look at the results after analysis to make sure they are what you think they are.

So someone was saying they used the test version of our software with 270 families, sib-pairs and parents.  How many principals should I remove?  3, 10, 50, or 100?  In family based analysis, again we’ll be trying to do a presentation in the future on this.  We’ve been a little more careful, specifically with extended families, in not removing too many Eigenvalues, because there could be, obviously, this correlation between family members that might start showing up in your larger Eigenvalues.  So we’ve often been going a little lighter on the Eigenvalue correction.  And so the one way to kind of get it, what the right number of Eigenvalues is, is you kind of iterate, look at your association tests.  If you’re just seeing a set of noise across the whole genome where the batch effects are dominating, it’s usually useful to go a little higher on your Eigenvalues.  So you can kind of iterate on that.  Hope that helps.  By all means you could talk to us some more about that.

Let’s see.  So, what variables should you get from a Illumina filing report to enable all the analysis you described?  I’m glad you asked that.  We have a plug-in that you can download from our website that will export what you need directly from BeadStudio into a binary form that’s ready to go.  There’s also the more cumbersome approach, where you could export as a text format and translate it into a CNT format.  This is actually an Affy format, which was a way you could get in non-Affy platform data, and so some people have looked at Nimblegen and Agilent arrays with our analysis as well.  So basically, there should be a tutorial on this somewhere on our website, or at least some help in the manual about this plug-in.

So there’s a question, any comment on GC-rich SNP probes showing more intensity than the less GC-rich SNPs?  Yes.  In fact, they can differ by a factor of 2 or 3 sometimes.  The interesting thing though is we spent a lot of time looking at GC content, and we even did modeling on the sequence itself, we did very complex predictive models, and we found that GC content’s really only a low order modifier of the binding.  That the entire sequence has a lot to say.  What we found was, we can really predict this variation based on the sequence, including how many Gs and Cs there are inside the probe sequence.  But what happens is, when you do a comparison against a reference population with the Log-Ratios, what happens is, both for your specific SNP, the GC content for the reference and the SNP are the same, and so basically the GC content difference disappears in that division.  So essentially all the stuff we were doing on predictive modeling was just trying to predict something that you had the data for, and could factor out just by simply doing this approach.  So we’re actually, I made the, maybe the controversial statement that I don’t really see a benefit to try to use GC content when you’re doing these whole genome studies with large numbers of references.  Now if you had a very small number of references and you’re trying to do a study, I can see how some modeling on some GC content and the probe sequence would help.

Next question.  Could you comment on the effect usefulness of pooling in case control studies?  We have not really worked with pooling studies.  We’ve talked to some people who have been doing it.  I guess I don’t have enough experience to really talk about it.  I guess the usefulness is it costs less to pool things, and potentially you could just do two experiments, or some replicants of your pool samples, and do a comparison of those two, of your pool set of controls, pool set of cases.  The thing we’re seeing in the real data is, the regions of copy numbered variation really differ, like the length of them from one sample to the next, even in a region.  So that can kind of be captured when you do each sample separately, and in a pooling thing you’d essentially be doing, just stacking them all on top of one another.  So just something to think about as you’re doing those. 

Do you correct for GC content?  I guess I talked a bit about that before, that essentially you are correcting for GC content when you do that comparison of the samples against the reference.

What is the appropriate parameter for min markers per segment for Affy 6.0 CNV analysis?  Good question.  So what we’ve been using is 10 as a default, and that seems to work pretty well for us, knowing that what tends to happen is there will be more false positives at that low range.  What this person is asking about, is what’s the smallest region to CNV that you would look for, and this would be in the univariate analysis.  But multivariate – and by the way, we also recommend, we do a window size of 5,000 markers, and typically look for up to 20 segments within that window size of 5,000.  Seems like a good balance between compute time and being able to really cover the chromosomes.  Why we do a window size of 5,000 rather than the whole chromosome is it provides a consistent length that you’re finding the CNV regions on, and there’s some esoteric things that I probably shouldn’t go into on the methods of why that’s the case, why it matters, but we’ve been going with about a window size of 5,000.

Another question.  Why are Log-Ratios chosen for association analysis, rather than CNV genotype assigned in previous steps?  How do reduce noise when using Log-R ratios directly.  So, I guess why we spend a lot of time on Log-R ratios, was initially we were finding the data was such that you were artificially districtizing data that, especially for the small CNV regions, certainly if you have a region of 100 markers you could probably correctly and easily discretize it.  But for 10, 15, 50 markers, or a standard 30 to 50 markers, the means are really all over the map.  And so we then said, “Well look, let’s just use the data as it is and use it as a continuous covariate.”  We found after doing a lot of segmentation that it seems to provide additional information.  So quick and dirty analysis, Log-Ratios are nice.  How to reduce noise is to, we really are reducing the noise by taking the means of the segments, as well as when we go to descretization, that’s really the way to reduce the noise, knowing that we tend to misclassify some of the segments.

And that being said, I’m going to go back to my original statement.  This data really is quite noisy, and so what we’re really trying to find is find some significant regions, follow it up with some additional higher resolution analysis, and then write up our papers.  So, anyhow, I think that’s all the questions we’ve had for today.  I’d like to thank you all for attending.  We’ll be making this presentation available in the future on our website, if you’d like to review it again or have some colleagues watch it.  And again, please feel free to contact us, and if you’d like to work together on doing analysis, we’re at the place now where we’re really in a mode of needing to charge for this type of analysis, but we’d love to keep it reasonable, as well as, we’ll help you catch the first fish, and then you’ll be able to catch all the next fish yourself, and do further analyses.  So hopefully this has moved you in the direction of at least thinking, this is doable.  The tools are available, and we encourage you, if you have a whole genome study that’s been sitting on the shelf, that you maybe did the SNP association or thinking about CNV, doesn’t hurt to give it a shot and do the whole genome copy number variation analysis.  So thanks again for attending, and we’ll look forward to having you visit us for a future webinar.  Take care. 



© 2012 Golden Helix, Inc     Facebook     Twitter     Linked In     Blog   YouTube

Site Map   |   Privacy Policy   |   Contact Us