Welcome to the VarSeq Whole Genome CNV Caller Tutorial!
Updated: January 13, 2021
Version: 2.2.2 or higher
This tutorial covers the basics of the VarSeq Whole Genome CNV calling algorithm with an emphasis on visualization and interpretation of results.
To complete this tutorial you will need to download and unzip the following file, which includes a starter project.
This workflow requires an active VarSeq license with the CNV Caller on Binned Regions feature included. You can go to Discover VarSeq or email email@example.com to request an evaluation license with the CNV functionality included.
Files included in the above ZIP file: VarSeq WGS CNV Caller Tutorial – Starter project containing the variant and coverage data for 21 samples.
VarSeq version 2.2.2 was used to create this tutorial. While every attempt will be made to keep this content relevant, it is possible that certain features or icons may change with newer releases.
The most recent version of VarSeq can be downloaded from here: VarSeq Download.
Select your operating system and download. Additional information for platform specific installation can be found in the Installing and Initializing section of the manual.
The Setup Wizard will then guide you through the setup process.
On the final page of the Setup Wizard, select Finish with the Launch VarSeq option checked.
This will bring up the introductory VarSeq page where new users can register their information. This will lead to a confirmation email being sent to confirm the email address.
Once the email has been confirmed, users can select the Login tab and enter their login email and password.
At this point, the VarSeq Viewer mode is accessed and can be used. If the user already has a license key, this can be activated by selecting Help on the title bar and then selecting Activate a VarSeq License Key.
This will bring up a dialog where the license key can be entered. Enter you license key, select and select Verify.
Once the license key is verified, select the I accept the license agreement after reading the agreement, and select Verify.
Congratulations! At this point, the product license is activated and you are ready to start an example project or a tutorial!
Note: During the initial installation process, the user will be asked where to store the AppData folder. Although this location can be changed after installation, it is recommended that multiple-user organizations select a shared drive location to increase ease of project sharing and to decrease redundancy.
WGS CNV Calling Algorithm Overview
VarSeq ® software supports calling CNVs from coverage data computed from imported BAM files. This tutorial focuses on calling and interpretation of CNVs using VarSeq from whole genome sequencing (WGS) data.
In this tutorial, we will begin by opening an existing project containing computed binned coverage data for a number of samples. Using this coverage data, we will call CNVs, plot the CNV data, and interpret the results.
The project files are contained in the ZIP folder that accompanies this tutorial. This project contains variant and coverage data for 21 samples. After the ZIP folder has been downloaded, extract the contents to a convenient location.
The VarSeq WGS CNV calling algorithm relies on coverage information computed from BAM files. The algorithm uses changes in coverage relative to a collection of reference samples as evidence of CNV events. Using these reference samples, the algorithm computes two evidence metrics: Z Score and Ratio. The Z Score measures the number of standard deviations from the reference sample mean, while the Ratio is the normalized mean for the sample of interest divided by the average normalized mean for the reference samples. The utility of these metrics can be seen by looking at the duplication event shown below.The composition of Reference Samples has some strong recommendations:
- Having 30 or more reference samples
- Derived from the sample library prep methods though not necessary to come from the same run
In the Figure 1-1, the drop in both Z Score and Ratio over multiple exons of the ATM gene provide supporting evidence for the called het deletion event.
The WGS Binned CNV caller is generally looking for large CNV events (on the scale of multiple genes or even an entire chromosome). The Z Score and Ratio metrics can exhibit some noise over these larger event regions and register a larger event as multiple smaller events. The solution to this problem is segmentation which looks to group the smaller events and lump them into one larger event.
You can easily see this in Figure 1-2 where an entire chromosome 2 duplication run without segmentation has many small duplication events, but with segmentation, the large aneuploidy event is accurately represented.
Using these three metrics, the algorithm assigns a CNV state to each binned region and then merges these regions to obtain contiguous CNV events.
Once a set of CNV events have been called, quality control flagging is performed to identify unreliable samples and potentially problematic CNV calls. These QC flags are applied to both CNV events as well as samples.
The following are examples of CNV event flags:
- Low reference sample read depth in the surrounding region;
- High variation in the region between reference samples; and
- If Ratio or Z Score fall within the noise of the surrounding region.
The following are examples of Sample flags:
- Their metrics have extremely high variation;
- Samples have very low mean depth; and
- Samples differ significantly from the selected reference samples.
By flagging these events and samples, we provide a second layer of heuristics, which can be used to reduce false positives and identify questionable CNV calls.
Importing Variant and Alignment Data
Important: The starter project provided in this tutorial already contains the variants and coverage data for 21 samples. In this portion of the tutorial we will show you how the import of the VCF variant data was completed and how the coverage data was computed on the BAM files so you can also follow along using your own data instead of using the provided project.
If you are already familiar with this process or will be working with the project provided for this tutorial, please skip to the Running the CNV Caller section of the tutorial.
As mentioned earlier, The VS-CNV algorithm uses changes in coverage relative to a collection of reference samples as evidence of CNV events. To create a set of reference samples to be used as a basis for CNV calling, users can compute coverage on BAM files using the Reference Sample Manager.
- Open VarSeq and click Tools > Manage Reference Samples. This menu computes coverage on BAM files and subsequently adds CNV Reference samples to the reference sample library.
Click on the Add References button and select Add Files on the first screen of the Add CNV References to add sample BAMs.
Ensure that Binned References is selected. Next, if there are regions to exclude, click on Select Track to browse to the interval track (BED file) that defines the regions that coverage will not be calculated over. Note users can import their own BED files using the Convert Wizard. Once an interval track has been selected, click Create to create a set reference samples to be used as a basis for CNV calling.
Now that you have added samples to the reference sample set. You can create a VarSeq Project and import samples to call CNVs on.
- Open VarSeq and click Create New Project. Select the Empty Project option. Select your genome assembly and a name for the project and click OK.
- Click on the Import Variants button and select Add Files on the first screen of the Import Variants Wizard.
- Navigate to the directory where your VCF files are saved and select them for import (like is seen in Figure 2-5). And then click Next >.
If you do not use the Manage Reference Samples option to import your reference samples as mentioned above, you will need to import enough samples to build your Reference Panel. 30 samples is the recommended number of reference samples. Therefore, you will want to import at least 31 samples, 30 used for reference and an additional sample for analysis.
Once the 31 samples are processed through the CNV tool, VarSeq will save the coverage profile for these samples in the Coverage Reference Samples folder found in the VarSeq User Data location on your computer (Tools > Open Folder > Reference Samples Folder).
For any subsequent run of the algorithm you can import any number of samples for analysis and VarSeq will pull a reference set of samples from those available in the Reference Sample Folder.
If importing into an Empty Project you can select the Individual Samples option in the Sample Relationships dialog. Click Next >.
On the next dialog we will be associating the BAM files with the imported VCF files so that Targeted Region Coverage can be computed.
- Click Associate BAM File at the top of the dialog and navigate to the directory where your BAM files are stored. If your BAM files names match the sample name or file name for the VCF file then they should be automatically associated, if not then manually select each BAM file. Click OK once done.
The BAM file paths should now be filled out for each sample on the import dialog.
Click Next > and Finish to complete the VCF variant data import.
Now to compute the binned coverage calculations required to detect CNVs.
- Go to Add > Computed Data… to bring up the different algorithm options for the project.
Scrolling near the bottom to the Sample section, select Binned Region Coverage and then click on OK.
The Binned Region Parameters dialog then appears with different options like Additional Depth Threshold and the option to mask specific regions, but for this tutorial, leave the default options and select OK.
Note: It is important to note that samples will only be matched to reference samples with the same bin size.
Once this computation finishes you are ready to begin CNV calling.
Running the CNV Caller
When you open the example project accompanying this tutorial, you will be greeted by the VarSeq Coverage Regions table. This table includes information about the read depth of each coverage region for the sample of interest.
To call CNVs over these coverage regions:
- Click the Add button in the upper left-hand corner of the window
- Select Computed Data
- Change the dropdown menu to Coverage Regions
- Select CNV Caller on Binned Regions.
This will open up the Binned CNV Caller dialog window.
The options presented here include the following:
- Minimum Number of Reference Samples: The minimum number of reference samples to be selected by the algorithm.
- Maximum Number of Reference Samples: The maximum number of reference samples to be selected by the algorithm.
- Exclude reference samples with percent difference greater than: This option will filter reference samples with a percent difference above the specified value after a minimum of 10 samples have been selected.
- Add samples to reference set: This option adds the current project’s sample to the set of reference samples.
- Independently normalize non-autosomal targets: If this option is selected, non-autosomal targets will not be normalized using the autosomal targets, but will instead be normalized separately. This option should be used if few non-autosomal targets are present, or if the entire X or Y chromosomes are likely to be deleted or duplicated.
- Reference Sample Folder: Specifies the file location where the reference samples are stored.
- Z-Score Threshold: Specifies the Z Score cutoff threshold for calling CNV events.
- Controls average target mean depth below: Flags targets with average reference sample depth below the specified value.
- Controls variation coefficient above: Flags targets for which the variation coefficient is above the specified value. A high variation coefficient indicates that there is extreme variation in reference sample coverage for the target region.
- Use optimal segmentation algorithm (slower): Instigates the optimal segmentation algorithm with takes more time to complete.
Leaving the default options to run the CNV calling algorithm:
- Click OK
When the algorithm runs, it will select a set of reference samples for each sample in the project. The reference set is chosen from the collection of samples in the reference folder that share the same binned regions as the sample of interest. The algorithm selects those samples that are most similar to the sample of interest in terms of normalized coverage.
Because we chose to Add samples to reference set, the 21 samples in our coverage table will first be placed in our reference set and then used by the algorithm. If one of the project samples was already added to the reference sample set, it will not be duplicated in the CNV analysis.
Performing Sample QC
Once the CNV caller finishes computing results, a new table will be created labeled CNVs. This table contains the information related to each CNV called by the algorithm, but before examining these results, users should always perform sample-level quality control. This can be done by exploring the sample table, which is now populated with several useful metrics related to the CNV algorithm.
To open the sample table in VarSeq, click on the plus sign on the tab bar and then select Table.
This will open up a new blank table tab. From the Select Table Type… dropdown menu, select Samples to display the Samples table.
You will notice that the Samples table has the column groups (from left to right) of “Sample Info”, “Binned Coverage Statistics”, and “Copy Number Variants”. The first group, “Sample Info”, displays the Sample Name, Affection status, Sex, and BAM Path. The next group shows the coverage statistic information associated with the Binned Coverage algorithm used to compute the Coverage Statistics. The third group displays the information for each CNV called. Scroll over to the “Copy Number Variants” heading.
The most useful field for sample QC is the “Sample Flags” field. This field will list one or more of the following flags if the sample fails any of our quality tests:
- High IQR: High interquartile range for Z-score and ratio. This flag indicates that there is high variance between targets for one or more of the evidence metrics.
- Low Sample Mean Depth: Sample mean depth below 30.
- Mismatch to reference samples: Match score indicates low similarity to control samples.
- Mismatch to non-autosomal reference samples: Match score indicates low similarity to non-autosomal control samples.
- Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs.
If any of the first three flags are listed for a given sample, then all CNV calls associated with the sample will most likely be unreliable, while if last two flags are present, then CNV calls in non-autosomal will be unreliable.
Notice the five highlighted samples in Figure 4-3 with the High IQR flag. The low matching quality of these samples may warrant rerunning the samples to improve their quality to better match the additional reference samples.
In addition to QC flags, the sample table also provides summary information about the number of CNVs called, the inferred gender of the sample, the reference samples chosen, and the percent difference between each sample and it’s the references set.
Plotting and Filtering CNV Data
Now that we have performed sample-level QC, we can filter and plot our CNV calls, along with the relevant evidence.
To switch from the sample table to the CNV table:
- Select the CNVs tab from the tab bar.
The CNV table provides many useful pieces of information ideal for filtering CNV calls and plotting the CNV results can be helpful when performing analysis.
Before plotting, the CNV State column can be queried to exclude missing values. This can be done by right-clicking on the CNV State column header and selecting Query Column Values.
This opens up a new filter tag along the top of the CNV tab. Click on the question mark to display the different options in this column.
Checking all of the options in this list will keep them in the column and then remove any missing values. The selected options here are Duplicate, Het Deletion, Deletion, and CN LoH in this example. The selected options will now appear in the query filter tab in the header, and clicking anywhere on the screen outside of the query value selection window will set the currently selected configuration and close the window.
Although there are no CNVs called in this sample (Female 1), we will continue to set up the CNV analysis and then look at a different sample with a CNV called.
Now that the missing CNV State values are not being shown, the CNV State column can be implemented into the filter chain to isolate the specific events per sample. This is done by right-clicking on the CNV State column header and selecting Add to Filter Chain.
This allows for the selection of CN LoH events, Deletions, Duplications, and Het Deletions for the given sample.
This field can also be plotted by right-clicking on the CNV State column header and selecting Plot for Current Sample.
This will open a GenomeBrowse view containing the CNV State of the current sample plotted along side the gene track. You may have to click on the CNV row in the CNV table to navigate directly to the event.
In addition to the CNV state, it is also useful to plot the evidence used to call the CNVs. To do this:
- Open the coverage table by selecting the Coverage Regions tab.
This table contains the CNV data associated with each bin coverage region. This includes the regional CNV State, Flags, Z Score, Ratio, and the number of variants for which VAF was considered.
The two primary pieces of evidence used to call CNVs are the Z Score and Ratio.
To plot these fields:
- Right-click on the Z Score column, then select Plot for Current Sample.
- Then, right-click on the Ratio column, then select Plot for Current Sample.
Next we will take a look at one of the CNV entries in a different sample. To switch samples:
- Select the dropdown menu on the title bar and select sample, Female 2.
Next, take a look at the CNV calls associated with this sample by:
- Select the CNVs table tab.
There are two CNVs called for this sample. They are both Duplications found in chromosome 3 and they are highlighed in Figure 5-11.
When the table view window and GenomeBrowse window are both displayed, the GenomeBrowse view will move to the detailed genomic region about the selected CNV. Here, the user can more easily notice the elevated Z Score and Ratio values for the CNV event compared to the surrounding diploid regions.
The CNVs that are found with the Binned Region coverage algorithm are usually large events on the scale of entire genes or even entire chromosomes. For the previous example, we can determine which genes are overlapped by the CNV by taking the following steps:
- Click on the Add title bar icon and then Computed Data…
- Change the dropdown menu on the top from Variants to CNVs
- Select ACMG Sample CNV Classifier from the Per Sample section.
This brings up the Algorithm Options window which allows users to Expand events to exon boundaries and select or create an Internal Database of Triplosensitive and Haploinsufficient genes. To learn more about assessment catalogs, refer to this resource. Once your assessment catalog is selected, click OK.
This will then display the interface of further Algorithm Options asking to specify the Minimum similarity coefficient. This threshold is the minimum similarity coefficient of matching annotation regions. Leave as default and select OK.
This will bring up the dialog of Annotate Overlapping Genes where preferred transcripts can be entered. Leave the options as default and click OK.
The ACMG Samples CNV Classifier is now loaded as an annotation in the CNV table as well as Copy Number Probability/Segregation. For more information on these fields please refer to our release notes. Next, change the sample to Female 4, and scroll over to the right to the ACMG Classifier, which has three CNV events classified as Likely Pathogenic, Pathogenic, and VUS.
The next steps will include importing and evaluating the CNV events in VSClinical. To do this, close out of GenomeBrowse and click on the “+” icon to open a new tab, then select VSClinical.
This will populate the interface to select between the AMP Guidelines or ACMG Guidelines. Select ACMG Guidelines.
The next dialog will be the VSClinical ACMG Options where users will need to specify or create assessment catalogs in the Options window. Users will need to create Internal Database of Classified Germline ACMG Variants for Samples:, Internal Database of Classified Germline ACMG CNVs for Samples:, and Internal Database of Gene Dosage Sensitivity Curations. Here is a resource with detailed information into the different types of assessment catalogs. Once these internal databases have been created and selected, click OK.
This will populate with further Project Options. In this dialog, the variant sets for reporting Primary and Secondary Findings, and Uncertain Significance need to be created. To do this select Create in the prompt on the lower part of the screen, which will autopopulate these fields. Next click Apply then Close.
The last step before entering into the VSClinical interpretation hub is to select Start New Evaluation.
VSClinical CNV Evaluation
To begin evaluating CNVs with VSClinical, you must first add the CNV events into the interface. To do this, scroll down to the CNVs to Evaluate section in the Evaluation tab. From here, select Add CNVs From Project.
This will automatically populate the CNVs that were filtered in the project. Next, select Prepare to Add, and then Add 3 CNVs.
Once the CNVs have been added, navigate to the CNVs tab. For this example, we will evaluate the CNV event that covers the critical gene PBX1. To evaluate this event, click on the red icon that contains PBX1 in the upper right corner of the interface.
The VSClinical interface allows users to evaluate CNVs according to the ACMG CNV guidelines. The first criteria that we will evaluate will be to determine if the event overlaps protein coding regions. To see this, scroll down to the Genomic Region section. As shown, this event overlaps 1443 protein coding genes. Since this is greater than 35 genes, criteria 3C is applied with a point value of 0.90. Additionally, we can see Reasons for Not 4O: which indicates that the CNV is not contained by any high frequency regions in DGV or GnomAD.
Next, scroll down to the Overlapping Genes section. In the case of an event that spans multiple genes, the algorithm will score and rank the genes based on their evidence of dosage pathogenicity. In this example there are multiple genes that have Evidence of Haploinsufficiency but the one we will evaluate will be PBX1. If you want to change the gene that will be evaluated, you can select it from the list.
Next, scroll down to the section: Reporting for PBX1 (Scored Gene). On the right hand side, there is a CNV Loss Scoring Criteria for PBX1, which gives users the ability to follow the decision tree for each criteria. To view this information click on the Edit or Start icon for the individual criteria. Since this event spans multiple genes that have evidence for haploinsufficiency, the automatic score is +1.90, which indicates that this is a Pathogenic CNV. If the user wants to incorporate information regarding the impact on the gene or about the gene in general, you can click on the blue “+” icons below the Auto Interpretation of Gene Impact and the Gene Summary dialogs. This will add the information to the Notes on Relevance to Patient and Notes on Gene. Once completed, navigate back to the top to the CNV Interpretation for Sample Female 4 section.
The final step in the evaluation process is creating the interpretation for the CNV event. To create the interpretation, you can click on the blue “+” icons on the right hand side under CNV Summary, Notes on Relevance to Patient, and Gene Role in Disease and Evidence of Haploinsufficiency. This information will then populate in the Interpretation section on the left side. Next, select Review & Save Now under Interpretation, which will save the interpretation to your assessment catalog. The last step is to select this CNV as a Primary Finding next to the Reporting As option. Once completed, navigate to the Report tab.
The last step in the evaluation is to create the report. In the Reports tab, users can enter any relevant patient or sample information just by clicking on the individual dialogs and entering the information on the right side. Additionally, all of the information regarding the created CNV evaluation is automatically incorporated, which can be viewed by scrolling down. Once the patient and sample information has been entered, click on the word document icon under Report Exports on the right hand side.
In the next interface, select New Report Template.
This gives the user to select the type of template to use under the Copy New Report Template dropdown menu. For this project, we will use the default Gene Panel Template.dox. Under the New Template Name: type in CNV Template and then click Create.
This will bring you back to the original report interface. From here select Render.
This will generate a word-based report that includes all information from the evaluation.
This tutorial was designed to provide a demonstration of VarSeq’s WGS CNV calling capabilities.
If you are interested in getting a demo license to try out this and other features please request a demo from: Discover VarSeq
Additional features and capabilities are being added all the time, so if you do not see a feature you need for your workflows please do not hesitate to let us know at Golden Helix Support!