Welcome to the SVS Microarray CNV Quality Assurance Tutorial!
Updated: January 28, 2021
Packages: CNV Analysis, Power Seat
This tutorial covers a typical workflow for quality assurance procedures for whole genome CNV analysis based on microarray data. It also includes some discussion of importing and processing raw data files prior to beginning a CNV analysis project. The tutorial is built around the Affymetrix 500K array, but the workflows are generally applicable to most genotyping arrays and aCGH platforms. There are some anomalies with Illumina data where certain analysis steps may not directly apply. For more information on what to watch for with Illumina data, see CNV Analysis Tips for Illumina Data.
To complete this tutorial you will need to download and unzip the following file, which includes several datasets.
File included in the above ZIP file:
- Sample CNV Data Project file – contains Matching, LogR (samples columnwise and rowwise, explained further in tutorial) and Phenotype spreadsheets.
We hope you enjoy the experience and look forward to your feedback.
Copy number analysis is a broad term that encompasses many aspects and workflows for interrogating CNVs. Applications include CNV genome-wide association studies (GWAS), cytogenetics, de novo copy number detection between parents and offspring, or “paired analysis”, such as comparing normal cell DNA to tumor cell DNA. Each application has its own set of workflows and data types to work with, but there are many processes that overlap among them. The general workflow is as follows:
- Normalize raw data and generate log ratios. Array manufacturers generate raw intensities for each probe on an array. Many factors can influence the distribution of raw intensity values generated from each chip, and normalizing the raw values is necessary before making comparisons between subjects. The measurement that is most commonly used to determine copy number status is the “log ratio” (LR), also called “log-2 ratio” or “log-R ratio” (“R” is commonly used as a variable to represent probe intensity). The standard formula for LR calculation is log2(observed intensity/reference intensity). LR calculation requires using a reference panel to determine the “normal” or baseline intensity expected at each marker. Choosing a proper reference panel is critical as it can affect all subsequent analyses.
- Perform quality assurance. With copy number data, we have found that significant bias can be introduced from differences between experiments – plate-to-plate, machine -to-machine, and site-to-site – collectively referred to as batch effects. Additional sources of variation include sample extraction and preparation procedures, cell types, temperature fluctuations and even ozone levels within the lab. Beyond batch effects and experimental variability, the signal-to-noise properties of each sample affect the accuracy of CNV detection. Batch effects and noisy data confound results, with complications ranging from poorly defined copy number segments to false-positive and non-replicable association findings. Therefore, there are a number of quality assurance protocols one should consider to either filter problematic samples entirely or correct the LR data before performing further analyses.
- Detect copy number regions. The objective in this step is to determine regions in the genome where a given sample’s mean LR value differs from the reference. These regions are referred to as copy number variants (CNVs). A mean LR around zero indicates that a sample has the same number of copies as the reference. A LR segment mean above zero typically means there is a copy number gain, and a LR segment mean below zero means there is a copy number loss. The higher or lower the LR segment mean, the more or less copies there are respectively.There are numerous algorithms available to detect copy number regions. SVS employs two optimal segmentation algorithms. 1) The univariate method, which considers only one sample at a time, is ideal for detecting rare and/or large CNVs. 2) The multivariate method, which simultaneously considers all subjects, is ideal for detecting small, common CNVs. Some CNV detection algorithms also assign actual copy number calls (0,1,2,3…) for a segment based on thresholds of LR segment means, though there are some reasons why this is problematic (see CNV Analysis Tips for Illumina Data).
- Analyze the copy number variations. Here you look at the LR segment means (or copy number calls) to find differences between two or more groups (case/control, tumor/normal, parent/offspring, etc.). This can be done visually using advanced plotting techniques, or statistically using simple summary statistics or advanced association and regression analysis. We advocate a combination of all to ensure copy number differences are indeed real and not just data processing anomalies.
- Make sense of the findings. This can be done in a number of ways, (looking at genomic annotations, pathway analysis, gene ontologies, etc.) and in some respects, is outside the scope of this tutorial, though we will touch on a few sense-making techniques.
Considering this generalized workflow, you may begin copy number analysis at any of the steps using SVS depending on the data you have access to. For example, if your data provider has only given you access to pre-processed and called copy number data, then you can simply import the copy number calls and begin visualizing the data or performing association analysis. If you’re an Illumina customer and have access to GenomeStudio, you will most likely be able to export the normalized LRs, with which you can export and perform quality assurance or go right into copy number detection.
Whenever possible (and depending on how comfortable you are with upstream analysis), we recommend you start with the raw intensity data as there are a number of things to consider during each step of the process that can have profound ramifications for downstream analysis. This tutorial leads you through each step listed above with a focus on CNV-based GWAS using Affymetrix microarray data, but please keep in mind that no single tutorial can cover every aspect of copy number analysis. As we have stated in our blog post:
“Let’s face it: CNV analysis is not easy. While SNP-based GWAS follows a largely standardized process, there are few conventions for CNV analysis. CNV studies typically require extensive craftsmanship to fully understand the contents of the data and explain the findings. As much as we would like to automate the process, there are always factors that require the researcher to make informed decisions or educated guesses about how to proceed.”
2. Generate Normalized Log Ratios
The following steps outline the process for generating log ratios (LRs) from Affymetrix CEL files. However, CEL files are rather large and the import process can be time intensive. Therefore, rather than provide the CEL files and have you import them, we simply outline the process here and provide the results from the import process, including: Affy 500K CNV LogR – Samples Columnwise and Affy 500K CNV LogR – Samples Rowwise. For your reference, we’ve also provided the spreadsheet used to merge the two NSP and STY sets (unique to the Affymetrix 500K array) during import.
To learn more about the quantile normalization methodology in SVS, see the Extracting Copy Number Data section of the SVS Manual.
A. Importing Affymetrix CEL Files
- From the Project Navigator select Import >Affymetrix >CEL.
This will open the Import Affymetrix CEL Files window. First, select the CEL files you want to process. For 500K data, you must select files from both the NSP and STY arrays for each sample.
- Click the Add Directory button, navigate to the appropriate folder and select it. Your dialog should look like Figure 2-1. Click Next.
The CEL files you selected will first be scanned to ensure proper file formatting. The next window will have more options for importing the files.
- Check the 500K NSP/STY Matching box and click Choose Sheet. Select the Matching Dataset – Sheet 1 spreadsheet and click OK. – Again this step is unique for the Affymetrix 500K array.
In most cases we recommend using all your samples as a pooled reference set. However, there may be times when it’s better to use a subset of your samples (e.g. controls or samples that previously passed QC filtering) or a more common reference, such as one or more HapMap populations, which may be needed when only importing a few samples. In order to use a subset of your data, you need a reference spreadsheet with a column that contains 0’s for all of the samples you wish to use as a reference in the calculation and 1’s otherwise. If you wish to use the HapMap samples you can choose the appropriate population from the HapMap pre-computed populations drop down. For this tutorial, we used all samples.
- Leave the radio button All samples selected under Reference Set.
If you have not yet downloaded the Affymetrix 500K Marker Map, the default option for Marker Map: will be Affy 500K Marker Map.dsm (Auto Download). If you have downloaded the 500K marker map and it’s saved in the appropriate directory, the software will recognize this and it will be listed. If you wish to choose a different marker map, click on Choose Marker Map and select the appropriate one.
- If your project is located on a shared network drive, you will want to check the Specify Temp Directory box and Browse to an appropriate folder on your local machine. This folder will be used to store large intermediate files.
Under Output Options there are a number of options to choose from. Each can be used in a number of ways.
- For this tutorial we checked LogR ratios – samples rowwise and LogR ratios – samples columnwise.
- Change the Dataset name base: to Affy 500K CNV.
The window should look like Figure 2-2. Click Finish.
The normalization process can take several minutes per CEL file to complete depending on your hardware. Upon completion, the spreadsheets selected during the import process will be created in your project.
3. Derivative Log Ratio Spread
Sample quality assurance (QA) is absolutely critical in copy number analysis. If not addressed early, bad samples will lead to poorly defined copy number segments resulting in false-positive and non-replicable association findings. The following section leads you through a few of the more standard quality assurance (QA) procedures to identify samples that have poor quality log ratio data (e.g. excessive noise, genomic waves, and cell line artifacts) and those whose identity is of question (gender mismatch). There are a number of other quality assurance steps you might want to consider in your own study. To learn more about copy number QA methods, see the Genome-Wide Success Part III webcast.
Derivative log ratio spread (DLRS) is a measurement of point-to-point consistency or noisiness in log ratio (LR) data. It often correlates with low SNP call rates and over/under abundance of identified copy number segments. Samples with higher values of DLRS tend to have poor signal-to-noise properties and accurate CNV detection is often difficult for these samples. You may wish to exclude samples with high DLRS from analysis.
- Open Affy 500K CNV LogR – Samples Rowwise – Sheet 1.
- Choose Numeric >CNV QA >Derivative Log Ratio Spread.
The resulting spreadsheet contains DLRS values for every sample by chromosome and then two summary columns; ‘All’ and ‘Median’ (Figure 3-1). “All” is the genome-wide DLRS value, and “Median” is the median of the by-chromosome DLRS values. The by-chromosome values will usually be very consistent throughout the autosomes. We suggest using the ”Median” column for QC purposes, as it is robust to effects from the sex chromosomes. The Y chromosome, for example, has very different DLRS properties in males and females.
- Navigate to the far right side of the spreadsheet and right-click on the Median column header. Choose Plot Histogram.
The histogram in Figure 3-2 is created. Notice the apparent outliers on the right side of the histogram. Increasing the bin count will result in a finer resolution in the histogram.
- Click on the Graph 1 node in the Graph Control Interface and move the bin counter slider to the right (Figure 3-3).
It is important to note that the distribution of DLRS can vary dramatically depending on factors such as array platform (Affy 6 is usually noisier than Affy 500k) and DNA source (saliva samples are often noisier than blood samples). In order to determine which samples to exclude, we need to calculate outlier thresholds. A standard approach for outlier determination is to calculate the inter-quartile range (IQR) of the distribution and then set the outlier threshold to 1.5 IQRs from the third quartile.
- Open the Derivative Log Ratio Spread spreadsheet and choose Numeric >Statistics (per Column). Leave the default options and click OK.
You’ll get a spreadsheet with several summary statistics as well as upper and lower outlier thresholds based on the 1.5 multiplier for every numeric column in the previous spreadsheet (Figure 3-4).
We’re interested in the last row (Median) and the column Upper Outlier Threshold (8), which in this case is 0.3560 (Figure 3-4). Any sample with a median DLRS value above 0.3560 is a candidate for exclusion. You can see which samples these are by going back to the Derivative Log Ratio Spread spreadsheet and sorting on the Median column.
- Open the Derivative Log Ratio Spread spreadsheet, right-click on the Median column header (25), and choose Sort Descending.
The first seven samples have a median DLRS above 0.3560 though two could be considered borderline. For this tutorial, you will exclude the top five samples with the largest DLRS medians. Applying the same procedure to a different dataset will require the researcher’s judgment. Before excluding these samples, compare visually the sample with the highest DLRS (S198) and that with the lowest (S74).
- Open Affy 500 CNV LogR – Samples Columnwise
- Choose GenomeBrowse > Numeric Value Plot.
- Check the boxes for S74 and S198 and click Plot.
Your plot will look like Figure 3-5, which shows quite a dramatic difference in the spread of log ratio values between the two samples. The majority of the log ratio values for S198 range from about 2 to -2, and can be considered twice as “noisy” as S74, a sample where the majority of values range from -1 to 1.
Now exclude the five outlier samples. Open the sorted DLRS spreadsheet, Derivative Log Ratio Spread – Sheet 2. Click on the row label S198 to inactivate the row. Hold down the shift key and click on the row label S141. This will inactivate the first five rows (Figure 3-6).
The phenotypic spreadsheet will be used to keep track of all excluded samples. After each quality assurance procedure, excluded samples will be inactivated in the phenotypic spreadsheet and a subset will be created. This method will preserve the previously excluded samples set while excluding additional samples found.
- From the Derivative Log Ratio Spread – Sheet 2, choose Select > Apply Current Selection to Second Spreadsheet
- Apply filtered rows to the following spreadsheet: Sim_Pheno Dataset – Sheet 1, then click OK twice.
Rows 141, 158, 176, 179, and 198 should now be inactive. Notice in the upper-right portion of the spreadsheet window that there are 195 active rows out of 200. Next, create a subset of only the active rows.
- Choose Select >Subset Active Data.
- Rename this spreadsheet in the Project Navigator to Sim_Pheno – DLRS Subset.
- Close all open windows except the Project Navigator by selecting Window > Close All before continuing.
4. Wave Detection
The next quality assurance step involves detecting samples with genomic waves in their log ratio data. Genomic waves are phenomena whereby, despite normalization, the log ratio data appear to have a long-range wave pattern when plotted in genomic space. Waviness is hypothesized to be correlated with the GC content of the probes themselves in addition to the GC content of the region around the probes. This wave pattern wreaks havoc on copy number detection algorithms. SVS employs a wave correction algorithm (Diskin, et. al., 2008) but there are some cases when this can be more problematic than beneficial. Therefore, the approach used in this tutorial will simply remove samples with extreme wave factors.
- Open Affy 500K CNV LogR – Samples Rowwise – Sheet 1. Choose Numeric > CNV QA > Wave Detection/Correction.
- Leave the default parameters and click Run.
- The Wave Detection/Correction algorithm requires a reference file containing GC content information. This file will be automatically downloaded.
NOTE: If there is not a GC digest file available for your species, email email@example.com and we can create the necessary file.
Now plot a histogram of the Abs Wave Factors column.
- Right-click on the Abs Wave Factor column header (1) and choose Plot Histogram.
The histogram looks like Figure 4-2. (Again, you can adjust the bin count of the graph to create a finer resolution).
Again, there are outliers visible in this histogram that should be considered for exclusion. The authors of the wave detection method (Diskin, et. al., 2008) suggest a cutoff of 0.05, but we have found that the optimal cutoff point varies across studies and platforms. The tutorial will use the outlier thresholds to determine which samples warrant a second look.
- From the Wave Factors spreadsheet, choose Numeric >Statistics (per Column).
- Leave the default options and click OK.
The upper outlier threshold (8th Column) for Abs Wave Factor in this case is roughly 0.037 (Figure 4-3).
Now sort the Wave Factors spreadsheet to determine which samples have an Abs Wave Factor above 0.037.
- From the Wave Factors spreadsheet, right-click on the Abs Wave Factors column header (1) and choose Sort Descending.
There are five samples that are considered outliers according to the IQR multiplier (Figure 4-4). After visualizing the wave effect, you will inactivate these samples using the same method as before.
Look at S130’s log ratios to visualize the wave effect.
- Open the log ratio plot we created earlier: Plot of Numeric Values from Affy 500K CNV LogR – Samples Columnwise – Sheet 1.
- Click File >Add to open the Add Data Sources window, click Project and select the Affy 500K CNV LogR – Samples Columnwise – Sheet 1 spreadsheet, then check S130 and click Plot & Close.
Your plot should look like Figure 4-5 with sample 130 plotted at the top.
It’s hard to see the wave effect by looking at the entire genome but it becomes more apparent when you apply a median smooth to the data.
- Click on the S130 graph item under the S130 graph node in the Plot Tree.
- Under the Display tab of the Controls dialog, click the Smoothing drop down menu and select Median Symmetric and change the value to 10.
The plot looks like Figure 4-6 and you can begin to see the wave effect.
The effect becomes even more apparent when you zoom in.
- Select 2 from the chromosome selector drop down on the top of the window.
- Then on the Display tab of the Controls dialog, click Manual under Y-Range, enter -0.8 – 0.8 for the Y-Range, and press the enter key.
Now compare this to S74, which was deemed pretty clean earlier based on DLRS, by applying the same median smooth.
- Click on the S74 graph item under the S74 graph node in the Plot Tree
- Under the Display tab of the Controls dialog, click the Smoothing drop down menu and select Median Symmetric and change the value to 10.
- Then on the Layout tab of the Controls dialog, click the Manual y-axis zoom and enter -0.4 – 0.4 for the Y-Range.
- Hide sample S198 by clicking its appropriate graph node in the Plot Tree.
Your plot should look like Figure 4-7.
Now filter the outlier samples based on the wave factor. As previously described, use the rows in the Wave Factors spreadsheet to inactivate rows in the phenotype spreadsheet.
- Open the Wave Factors – Sheet 2 spreadsheet and left-click once on the first five row labels to inactivate them.
- Choose Select > Apply Current Selection to Second Spreadsheet.
- Apply filtered Rows to the following spreadsheet Sim_Pheno – DLRS Subset and click OK. Click OK again to finish.
Rows 46, 130, 135, and 163 should now be inactive (sample 158 was excluded previously due to DLRS). Create another active subset with the selected rows.
- Choose Select >Subset Active Data.
- Rename this spreadsheet in the Project Navigator to Sim_Pheno – DLRS/Wave Subset.
- Close all open windows except the Project Navigator before continuing.
5. Gender Screening
There are a few ways to verify whether a sample’s reported gender is consistent with the inferred gender. In SNP analysis, we can analyze X chromosome heterozygosity using genotype calls as discussed in the SNP GWAS Tutorial. In CNV analysis, we can examine the average observed intensity of the sex chromosomes to infer gender and to identify abnormalities of the sex chromosomes (i.e. XXY, XYY). If you are analyzing a dataset for which Y chromosome data is available, it can be useful to plot the chromosome X and Y intensities together in a scatterplot. Viewing the data in two dimensions will usually reveal distinct clusters for males and females, and combinations other than XX or XY will appear as outliers in the plot. Affy 500K array does not contain the Y chromosome, so for this tutorial you will investigate log ratio averages using only the X chromosome.
- Open Affy 500K CNV LogR – Samples Rowwise –Sheet 1 and choose Numeric >Statistics (per Row).
- Make sure only Mean is checked under Output Options and check Output results by chromosome. Select the Per Statistic radio button. (leave Integer Columns and Real columns selected under Include in calculations)
- Click OK.
A Statistics (per Row) – Mean spreadsheet is created with each sample’s average log ratio value across each chromosome. We’ll look at the X chromosome row mean to check gender. First join the phenotype spreadsheet, which contains reported gender, with this spreadsheet.
- Open Sim_Pheno Dataset – DLRS/Wave Subset and choose File >Join or Merge Spreadsheets.
- Select Statistics (per Row) – Mean from the spreadsheet list and click OK.
- In the Join or Merge spreadsheet window, leave the default options and click OK.
- From the resulting Sim_Pheno Dataset – DLRS/Wave Subset + Statistics (per Row) – Mean spreadsheet, right-click on the ChrX column header (25) select Plot Histogram.
There are two distinct distributions (Figure 5-1). The left one represents males and the right one females.
Now split on reported gender to identify any discrepancies.
- Click on the ChrX node in the Graph Control Interface and select the Color tab. Select the By Variable radio button and click Select Variable….
- Choose Sex and click OK. Then click the blue box for the Sex==’F’ option and change the color to yellow. Then click the green box for the Sex==’M’ option and change the color to blue.
By changing the opacity of the colors you can see if there are any samples that have a reported gender different than what’s inferred from the log ratios.
- Click on the Item tab move the opacity bar half way to the left.
In this example, there is one sample who’s reported gender is female but according to the log ratios the sample is characteristic of a male (Figure 5-2). Next you will determine which sample this is.
- Open Sim_Pheno Dataset – DLRS/Wave Subset + Statistics (per Row) – Mean – Sheet 1.
- Right-click on the ChrX column (25) and choose Sort Ascending. This puts the Males on top and the Females on the bottom.
- Scroll down until you find the F in the middle of all the Ms (row 31) in the Sex column (2). As shown in Figure 5-3, the sample with the incorrectly specified gender is S5.
Now exclude this sample from the phenotype dataset. In practice you may be able to rectify the phenotype spreadsheet by verifying that the gender was simply a data entry error and not a genomic anomaly.
- Open Sim_Pheno Dataset – DLRS/Wave Subset and inactive the row for S5.
- Create a subset spreadsheet by choosing Select >Subset Active Data.
- Rename the subset spreadsheet in the Project Navigator to Sim_Pheno – DLRS/Wave/Gender Subset.
- Close all open windows before continuing except the Project Navigator.
6. Chromosomal Abnormality Screening
Detecting large chromosomal aberrations is both a quality assurance step and an analysis step. For example, we are extra cautious when dealing with cell line data, where we’ve seen some samples to have amplifications of several entire chromosomes. In these cases, we would remove such samples from analysis. On the other hand, large chromosomal aberrations might be real (e.g. Down’s Syndrome as we’ll see in this study) and instrumental in detecting disease-causing loci.
One cursory way to detect large chromosomal aberrations is to look at the average log ratios across all autosomal chromosomes, similar to how we verified gender.
- Open the Statistics (per Row) – Mean spreadsheet again and choose Plot > Histograms.
- Check all the boxes except ChrX and Phenotype. Select Plot all items in one graph under Select item arrangement, and click Plot.
The resulting plot looks like Figure 6-1 (with a bin count of 64).
Notice the two outliers: one to the left and one to the right of the plot. The point on the right indicates that one sample has a much higher average than the others for Chromosome 21 (as indicated by the legend on the top of the graph or by clicking on the purple bar in this histogram). Similarly, the point on the left indicates that one sample has a much lower average than the others for Chromosome 18.
- From the Statistics (per Row) – Mean spreadsheet right-click on the Chr18 column (18) and select Sort Ascending.
This brings the sample with the lowest log ratio average for chromosome 18 to the top, which is sample 200.
- Next, right-click on the Chr21 column (21) and select Sort Descending.
This brings the sample with the highest log ratio average for chromosome 21 to the top, which is sample 139. Now plot these samples.
- Open Affy 500K CNV LogR – Samples Columnwise and choose GenomeBrowse >Numeric Value Plot.
- Select the boxes for S139 and S200 and click Plot.
You get the plot in Figure 6-2. Notice the large chromosomal aberrations in chromosome 21 for S139 and chromosome 18 for S200. Zoom in for a closer look.
- At the top of the plot window there is a region selection box (see Figure 6-3). Enter chr16-chr22 and press Enter.
Now you can see better resolution of those two chromosomes in the context of the surrounding normal chromosomes (Figure 6-4).
From the zoomed view you can see that S139 has what appears to be an entire extra copy of chromosome 21 which is representative of Down Syndrome. S200 has a copy number loss over a large portion of the q arm of chromosome 18. Neither of these appears to be cell line artifacts, but it’s a judgment call whether to leave them in or not for subsequent association analysis. For the purpose of this tutorial, leave them in.
This concludes the sample quality assurance portion of the tutorial. Next we’ll look at potentially a much larger quality assurance issue: batch effects.
- Close all open windows (Window > Close All) before continuing except the Project Navigator.
7. PCA for Detection and Correction of Batch Effects
This section outlines the process for detecting the presence of batch effects and other technical artifacts using principal component analysis (PCA), as well as correcting the log ratio data based on the PCA results.
A. How is PCA of LR data different from PCA of SNP data?
The raw signal intensity data produced by genotyping arrays is inherently noisy. For any given marker, the subjects in a study may have a wide range of intensity values for each of the two assayed alleles. The process of making genotype calls simplifies this noisy quantitative data, assigning each data point to one of three possible genotype categories. After removing the noise from the data, and assigning each individual to one of just three categories for a very large number of variables, it is possible to detect patterns in the data reflecting inherent differences between large subgroups of the study cohort.
The SNP GWAS tutorial demonstrates the use of principal components analysis to detect population stratification in a study cohort. Performing PCA with SNP data requires converting all of the genotypes to a numeric form. This is usually accomplished by converting each genotype call to 0, 1, or 2, representing the number of copies of the rare allele present at each locus. With PCA of LR data, the approach is somewhat different.
The LR values for SNP markers are based on the combined intensity of the probes for both possible alleles. We therefore assume that there are (almost) always two copies present at each marker, and we aren’t concerned about which alleles are present. Although the raw signal intensities are normalized prior to LR calculation, some substantial variation usually remains. In almost all cases, we have found that PCA of LR data reveals the subtle but systematic differences in genome-wide intensity distributions that result from experimental variability. We have repeatedly observed clustering patterns in the principal components that correlate directly with laboratory variables such as plates, array scanners, array manufacturing lots, processing order and/or date, DNA source, etc.
NOTE: See Dr. Christophe Lambert’s blog about study design for more examples.
B. Detecting the Presence of Batch Effects
Detecting the presence of batch effects and other technical artifacts can be accomplished in a variety of ways. A simple method is to perform tests of association between the genome-wide LR values and various experimental variables (e.g. batch) to see if there are a large number of spurious associations across the genome. If batch effects are present, the Q-Q plot for such a test will indicate a major inflation of significant results. Another method, outlined here, is to use principal component analysis (PCA), which not only detects the presence of batch effects, but in some instances can also be used to correct the data.
IMPORTANT: Using non-autosomal chromosomes to calculate principal components may result in an erroneous systematic shift of the data. We therefore recommend excluding them from PCA. The dataset in this tutorial contains information on the X chromosome, which you need to inactivate before continuing.
- From Affy 500K CNV LogR – Samples Rowwise – Sheet 1, choose Select > Activate by Chromosomes.
- Uncheck the X box and click OK.
- Select Numeric > Numeric Principal Component Analysis.
- Select Compute the principal components.
- Enter 100 for Find up to top ___ components.
NOTE: It is not useful to calculate more than N-1 components, where N is the total number of samples. The maximum in this case would be 199. We will calculate 100 in the interest of time.
- Under Output, uncheck Output corrected input data and check both Output principal components to spreadsheet and Output separate spreadsheet of eigenvalues.
- Check Center data by marker under Data Centering.
- Verify that your options match those of Figure 7-1 and click Run.
This process may take several minutes or more depending on your system configuration. (You may consider calculating a lower number of components to accelerate the process.) Upon completion, two new spreadsheets are created – the Principal Components (Center by Marker) spreadsheet and the PC Eigenvalues (Center by Marker) spreadsheet. We will use the phenotype and principal components spreadsheets to investigate possible batch effects or stratification.
- Open Sim_Pheno Dataset – Sheet 1 spreadsheet and active all data by going to Select > Activate All
- Then choose File > Join or Merge Spreadsheets. Select the Principal Components (Center by Marker) spreadsheet and click OK.
- From the join window, enter Pheno + PCs as the New Dataset Name, leave all other parameters as the defaults and click OK.
Joining the two spreadsheets together enables you to color-code each sample or data point according to a respective phenotypic or experimental variable from the spreadsheet. Ideally, the spreadsheet would contain related batch information that could be used to color-segregate the data. In this tutorial, other phenotypic variables are used to illustrate the same idea.
- From the Pheno + PCs – Sheet 1 spreadsheet select Plot > XY Scatter Plots.
The XY Scatter Parameters dialog appears with two list views. The list view on the left is for selecting the column (principal component) to represent the independent or X axis. The list view on the right is for selecting a single or multiple columns (principal components) to represent the dependent or Y axis. Start by first plotting the two largest principal components against each other.
- In the left list box select EV = 2.14661 (the largest, or first principal component). In the right list box check EV = 0.883451 (the second principal component) and click Plot.
The plot in Figure 7-2 is created with a primary cluster of data points on the left and several more points scattered to the right, which may be thought of as forming a separate cluster.
In ideal circumstances, with very consistent data, we expect all data points to form a single, cohesive grouping in this type of plot. We also expect that any observed clustering will not be related to the primary phenotype. If there is any clustering of cases and controls, this is usually indicative of batch effects or other systematic differences in the generation of the data, and it may cause problems in association testing.
- Click on the EV = 0.883145 node in the Graph Control Interface.
- Under the Color tab, select the By Variable radio button and click Select Variable…. Choose Phenotype from the list and click OK.
You now get a plot that looks like Figure 7-3 where the cluster on the left has mostly cases (green) and the number of controls (blue) increases toward the right side of the plot.
This plot suggests that there is a systematic difference between cases and controls. It could be that cases and controls were processed on separate plates, which would induce batch effects. Whether or not this happened could be determined if batch information were available.
This is just one example of a batch effect. For the sake of this tutorial we’ll accept that batch effects exist and move on to correcting for them. The remaining question is how many to correct for.
C. Determining the Correct Number of Principal Components to Correct For
Determining the correct number of principal components for correction is tricky. In an ideal situation you would correct for as many as possible without over-correcting and removing any signal that might be there. If you look at a scree plot of the eigenvalues (from the spreadsheet PC Eigenvalues (Center by Marker), right click on the Eigenvalue column header and select Plot Variable), you can see that the distance between principal components begins to decrease substantially between six and ten components (Figure 7-4).
As a simple rule, you might identify an “elbow” or inflection point in this range as the number of components that maximizes the amount of variability explained by the least number of principal components as the appropriate number of principal components for which to correct the data. However, visual inspection is not always accurate. We suggest a more holistic approach, which we’ve outlined in a separate tutorial due to its length. If you wish to follow this on your own, see the CNV PCA Search Tutorial.
Having run through this process, we see that the number of principal components to correct for in this dataset is seven (7).
D. Correcting for Batch Effects
Now rerun PCA analysis on the log ratios, this time outputting data corrected for seven principal components.
- From the Principal Components (Center by Marker) spreadsheet, inactivate all columns except the first 7.
- To do this scroll to column 8 and left-click the column label header twice turning the column gray.
- Holding the Shift key, scroll to the end of the spreadsheet and left-click the last column header. All the columns in between will become inactive.
Notice a new spreadsheet tab is created, Principal Components (Center by Marker) – Sheet 2, because SVS is preventing changes from happening to the original Principal Components (Center by Marker) spreadsheet, being that it is used by the join with the Sim_Pheno spreadsheet.
- Open Affy 500K CNV LogR – Samples Rowwise – Sheet 2 and select Numeric > Numeric Principal Component Analysis.
- Choose Use precomputed principal components. Click Select Sheet and select Principal Components (Center by Marker) – Sheet 2. Click OK.
- Check the Output corrected input data box. Make sure Center data by marker is still checked and uncheck everything else. Click Run.
The result is a new PCA-Corrected Data (Center by Marker assumed) spreadsheet upon which you can perform CNAM Optimal Segmenting to identify copy number variations.
NOTE: When using principal components as a QC tool for identifying technical artifacts, we recommend using all available subjects, as doing so may help to bring out underlying effects in the data. However, if you are using PCA to correct LR data as a final step prior to data analysis, It is often advisable to remove the low-quality samples prior to calculating the PCs.
8. Additional Tips and Ideas
Congratulations! You have now worked your way through an entire copy number QC process–no easy task. The steps described in this tutorial are appropriate for assessing sample quality for most CNV analysis applications. A few additional ideas are shown below. We wish you the best of luck on your own study. As always, if you have any questions or need help with any portion of this tutorial or your own analysis, please contact us.
A. “Master” QC spreadsheet and Alternate Method for Filtering Samples
In previous sections, this tutorial demonstrated filtering of problematic samples by inactivating rows of the phenotype spreadsheet corresponding to samples that failed various QC metrics.
Another approach that we highly recommend is to merge all relevant phenotype data and QC measurements into a single spreadsheet. This master spreadsheet can then be used for further exploration of sample quality. Using it, you might wish to visualize various measurements in relation to each other or perform statistical association tests to compare the QC metrics to each other and to the phenotype. You can use the spreadsheet editor to create binary indicator variables to identify samples that pass or fail each metric and to create the master index for which samples to use for downstream analysis. An example of what such a spreadsheet might look like is in Figure 8-1.
B. Further exploration of Principal Components
This tutorial has demonstrated how to plot the first and second principal components to visually examine the data for stratification. It is usually good to look at additional principal components as well. You might consider plotting additional scatter plots, or you might plot histograms of the individual components. Typically, any evidence of stratification will appear in the components with the largest eigenvalues. If you identify one or more principal components with evidence of stratification, you can merge the components with the master phenotype and QC spreadsheet in order to help identify the cause of the stratification. If the stratification is the result of a particular feature in the LR data (such as a very common large deletion), you can find this out by merging the principal components with the LR data and running a genome-wide numeric association test using the PC in question as the dependent variable. If a particular genomic region is causing the stratification, that region will be obvious in a Manhattan plot of the p-values. If the stratification is the result of experimental variation/batch effects, no single region is likely to be emphasized in the Manhattan plot.
C. Filter Samples on Autosomal Segment Counts
One useful QC tool that is not shown in this tutorial is counting the number of copy number segments identified in each sample by the univariate segmentation algorithm. This can only be done after segmentation is completed, and is demonstrated in the CNV analysis tutorial. Especially noisy samples tend to have a large number of CNV segments identified. High segment counts generally correlate with high waviness measurements.
D. Reprocessing Samples Prior to Analysis
After identifying problematic samples, you may wish to repeat the data normalization and LR calculations. This is particularly useful with Affymetrix data, where you have greater control over the selection of the reference data used for LR calculations. (Illumina’s GenomeStudio software uses reference intensity values calculated from a panel of HapMap subjects.) If you originally processed the data using all samples or even all controls as a pooled reference, then the presence of problematic samples might have introduced some unexpected artifacts. Repeating the data processing using only the “clean” samples may lead to a better final dataset for analysis.