Welcome to the Tutorial!
This tutorial covers the basics of the VarSeq CNV calling algorithm in the context of whole exome sequencing with an emphasis on annotation and filtering.
To complete this tutorial you will need to download and unzip the following file, which includes a starter project.
This workflow requires an active VarSeq license with the CNV Caller on Target Regions feature included. You can go to Discover VarSeq or email email@example.com to request an evaluation license with the CNV functionality included.
The starter project provided in this tutorial already contains the variants, coverage data, and CNV calls for a typical exome sequencing sample. The next few sections of the tutorial will demonstrate how to import VCF and BAM files, how to compute coverage statistics, and how to perform CNV calling so you can also follow along using your own data instead of the provided project.
If you are already familiar with this process you can skip to the Filtering Low Quality CNVs section of the tutorial.
CNV Calling Algorithm Overview
VarSeq ® software supports calling CNVs from coverage data computed from imported BAM files. This tutorial focuses on calling and interpretation of CNVs using VarSeq.
In this tutorial, we will begin by opening an existing project containing precomputed coverage data and CNV calls for a single whole exome sequencing sample. The project files are contained in the ZIP folder that accompanies this tutorial. After the ZIP folder has been downloaded, extract the contents to a convenient location.
The VarSeq CNV calling algorithm relies on coverage information computed from BAM files. The algorithm uses changes in coverage relative to a collection of reference samples as evidence of CNV events. Using these reference samples, the algorithm computes two evidence metrics: Z-score and Ratio. The Z-score measures the number of standard deviations from the reference sample mean, while the Ratio is the normalized mean for the sample of interest divided by the average normalized mean for the reference samples. The utility of these metrics can be seen by looking at the duplication event shown below.
In the figure above, the spike in both Z-score and Ratio over four exons of this gene provide supporting evidence for the called Duplication event.
A third metric used by the CNV caller is Variant Allele Frequency (VAF). While VAF is not a primary metric used for identification of CNVs, it can provide supporting evidence for, or against certain types of events. For example, values other than 0 or 1 are evidence against heterozygous deletion events, while values of 1/3 and 2/3 provides supporting evidence for duplications. The advantage provided by VAF can be seen in the figure below.
In the above figure, two exons were called as deletions prior to utilizing VAF. However, the presence of two variants with VAF of 0.5 within the region provides the algorithm with evidence against a deletion, allowing us to successfully classify the exons as diploid.
Using these three metrics, the algorithm assigns a CNV state to each target region and then merges these regions to obtain contiguous CNV events.
Once a set of CNV events have been called, quality control flagging is performed to identify unreliable samples and potentially problematic CNV calls. These QC flags are applied to both CNV events as well as samples. By flagging these events and samples, we provide a second layer of heuristics, which can be used to reduce false positives and identify questionable CNV calls.
Adding Reference Samples
The VS-CNV algorithm uses changes in coverage relative to a collection of reference samples as evidence of CNV events. To create a set reference samples to be used as a basis for CNV calling, users can compute coverage on BAM files using the Reference Sample Manager. To open the Reference Sample Manager, click Tools > Manage Reference Samples. This menu is used to compute coverage on reference sample BAM files and subsequently adds these samples to the reference sample library.
To add reference samples, click the Add References button and select Add Files on the first screen of the CNV References window to add sample BAM files.
Ensure that Target References is selected. Next click on Select Track to browse to the interval track or BED file that defines the regions that coverage will be calculated over. Once an interval track has been selected, click Create to create a set reference samples to be used as a basis for CNV calling.
Once you have added samples to the reference sample set, you are ready to construct the filtered targets track.
When performing CNV calling on exome sequencing data, it is important to filter out low quality targets before running the CNV caller. Low quality targets have inconsistent coverage across samples, which can negatively impact the quality of CNV calls. If such low quality targets are not excluded from the CNV calling process, they could skew coverage normalization and cause an increased number of false positives. There are several metrics that can be used to identify low quality targets including: GC-content, span, and average coverage. VarSeq provides a useful tool for identifying low quality target so that they can be excluded from CNV analysis.
To access this tool, select Tools > Manage Reference Samples and then click the Create Low Quality Targets button at the lower right corner of the window. This will launch the Filtered Targets Wizard. From here, you can select the desired panel from the dropdown menu and click Next to specify filter thresholds.
By default targets are filtered based on GC-content, span, and coverage, but custom annotation tracks can be added to exclude other problematic regions.
After clicking Next, you will be presented with the confirmation screen, allowing you to specify the Track Name and file path for the newly created annotation track. Once a set of filtered targets has been created, you are ready to create a VarSeq Project and import samples to call CNVs on.
Importing Variant and Alignment Data
Note that if you are following along with the pre-built tutorial project you will want to skip to the Filtering Low Quality CNVs section
To make a new project, click Create New Project, select the desired template, specify your genome assembly, name the project, and click OK.
Next, click on the Import Variants button and select Add Files on the first screen of the Import Variants Wizard. Navigate to the directory where your VCF files are saved and select them for import, then click Next. If importing into an Empty Project you will need to select your Sample Relationships on the next dialog.
In the following dialog we will be associating the BAM file with the imported VCF file so that Targeted Region Coverage can be computed. Click Associate BAM File at the top of the dialog and navigate to the directory where your BAM files are stored. If your BAM files have a similar naming convention to what is listed in the VCF file then they should be automatically associated, if not then manually select the BAM file. Click OK once done.
The BAM file path should now be filled out for the sample on the import dialog.
Click Next and Finish to complete the VCF variant data import.
Now that we have imported our variants and BAM file, we can compute coverage by clicking Add > Secondary Tables > Add Coverage Regions.
In this window, select the interval track that defines the regions over which to compute coverage by clicking Select Track and navigating to the location of your target region file.
After computing coverage, our next step is to run the LoH Caller. This algorithm will identify regions where the variant allele frequency indicates the presence of a loss of heterozygosity or triploid duplication. These regions will be used as supporting evidence when calling CNVs. To launch the LoH caller, click Add > Secondary Tables > Add LoHs.
Next, select the variant allele frequency field and click OK. In the next window, you can specify the minimum number of variants considered, the expected LoH rate, and the minimum required variant genotype quality. Click OK to run the LoH caller.
Now that we have computed coverage and identified LoH regions, we are ready to run the CNV caller.
Running the CNV Caller
To call CNVs over the coverage regions, click Add > Secondary Tables > Add CNVs.
This will launch the CNV calling configuration window, which allows you to set the various parameters associated with the algorithm.
The options presented here include the following:
- Sensitivity/Precision: Determines the sensitivity and precision of the CNV calling algorithm. A value of “Very High Precision” will result in fewer false positives but will increase the number of false negatives, while a value of “Very High Sensitivity” will produce more false positives but will result in fewer false negatives.
- Minimum Number of Reference Samples: The minimum number of reference samples to be selected by the algorithm.
- Maximum Number of Reference Samples: The maximum number of reference samples to be selected by the algorithm.
- Exclude reference samples with percent difference greater than: This option will filter reference samples with a percent difference above the specified value after a minimum of 10 samples have been selected.
- Add samples to reference set: This option adds the current project’s sample to the set of reference samples.
- Normalize sex chromosomes using only controls with matching sex: If this option is selected, non-autosomal chromosomes will only be normalized using reference samples whose sex matches that of the sample of interest.
- Controls average target mean depth below: Flags targets with average reference sample depth below the specified value.
- Controls variation coefficient above: Flags targets for which the variation coefficient is above the specified value. A high variation coefficient indicates that there is extreme variation in reference sample coverage for the target region.
- Targets Excluded From Normalization and CNV Calling: Here a region track can be selected that provides coordinates for regions that will be excluded from the normalization and CNV calling process. This is where you select the track created by the Filtered Targets Wizard
When designing your CNV calling pipeline, it is important to empirically determine the appropriate level of sensitivity and precision. We recommend benchmarking the algorithm across multiple levels of sensitivity by running the algorithm multiple times at different levels of precision (e.g. Very High Sensitivity, Balanced, and Very High Precision) and comparing the results to establish a best practice workflow. While the Balanced setting has been tuned to provide reasonable results when run on exome sequencing data, every lab has different requirements and performance varies across different lab preparation and sequencing methods.
Before running the CNV caller, we must first specify our set of filtered targets by clicking the Select Track button in the Targets Excluded From Normalization and CNV Calling section. Select the track computed using the Filtered Targets Wizard or through a method of your choosing. After updating this option, we can run the CNV caller by clicking OK.
When the algorithm runs, it will select a set of reference samples for each sample in the project. The reference set is chosen from the collection of samples in the reference folder that share the same target regions as the sample of interest. The algorithm selects those samples that are most similar to the sample of interest in terms of normalized coverage. Once the CNV caller finishes computing the results, a new table will be created labeled CNVs. This table contains the information associated with each CNV called by the algorithm.
Performing Sample QC
Before examining the CNVs called by the algorithm, we must first perform sample-level quality control. This is done by examining the sample table, which contains several useful metrics related to the CNV algorithm. To open the samples table in VarSeq, select Samples from the drop down directly above the left side of the current table.
You will notice that the CNV caller has populated with sample table with a number of fields under the heading “Copy Number Variants”.
The most useful field for sample QC is the “Sample Flags” field. This field will list one or more of the following flags if the sample fails any of our quality tests:
- High IQR: High interquartile range for Z-score and ratio. This flag indicates that there is high variance between targets for one or more of the evidence metrics.
- Low Sample Mean Depth: Sample mean depth below 30.
- Mismatch to reference samples: Match score indicates low similarity to control samples.
- Mismatch to non-autosomal reference samples: Match score indicates low similarity to non-autosomal control samples.
- Few Gender Matches: Not enough reference samples with matching gender to call X and Y CNVs.
The “Mismatch to reference samples” flag can often be resolved by rerunning the algorithm once more samples have been added to the reference set.
In addition to QC flags, the sample table also provides summary information about the number of CNVs called, the inferred gender of the sample, the reference samples chosen, and the percent difference between each sample and it’s associated references.
Filtering Low Quality CNVs
When performing exome analysis, calling CNVs is only the beginning, as a typical exome will contain hundreds of CNVs, most of which will be either low-quality or clinically irrelevant. Thus, any clinical workflow for CNVs called from exome sequencing data must include a tertiary analysis process to identify a small set of high-quality clinically relevant CNVs.
This process involves the creation of a filter chain which will specify a set of rules used to exclude CNVs from our final analysis. To construct a filter chain from the CNV table, right click on the desired filter column and select Add to Filter Chain. We will start by filtering down to CNVs present in the current sample. To add this filter, right click on the CNV State column and select Add to Filter Chain.
This will show the Filter chain which now includes the newly created CNV State filter. Select the options, Deletion, Duplicate, and Het Deletion to filter out LoH events along with CNVs not present in the current sample.
Next, we will filter out low-quality CNVs that are likely to be false positives. We will start by filtering out flagged CNVs. To do this right click on the Flags column and select Add to Filter Chain. Next, select the Missing option to exclude flagged CNVs from the final result. For our next quality filter, we will utilize the p-value field. As before, right click this field and select Add to Filter Chain. In the newly added p-value filter, enter a value of 0.01 and hit Enter. This will bring up four filter options:
- Less than 0.01
- Equal to 0.01
- Greater than 0.01
Select the Less than 0.01 option. After adding these QC filters, the filter chain will look like the figure below.
ACMG CNV Classifier Algorithm
Now that he have filtered out low quality calls, we can begin the process of identifying clinically relevant CNVs. In order to do this, we must evaluate each CNV’s impact on its overlapping genes using the ACMG CNV Classifier. To run the ACMG classifier, select Add > Computed Data then, from the Table Type dropdown menu, select CNVs. Next, choose the ACMG Sample CNV Classifier algorithm and click OK.
Click OK on the next window to download the sources necessary to run the algorithm, leaving the default settings unchanged.
You will then be prompted to specify the options for running the algorithm dependencies. For now you can leave the default values and click OK to continue running the annotations.
Once the algorithm finishes computing results, you can scroll to the right to see the output fields. There are several fields worth mentioning when it comes to evaluating a CNV’s impact:
- Total Score: The total score of the CNV if scored in accordance with the ACMG guidelines.
- Classification: The CNV’s classification based on the total score.
- Criteria: The recommended ACMG guidelines criteria.
- Potential Gene Impact Scores: The score of the CNV for each overlapping gene, evaluated as if the gene has established evidence for dosage sensitivity.
Let’s start by adding a filter on the Classification field to exclude CNVs that are likely to be benign. In the newly added filter, select the options Benign and Likely Benign, then right click the filter card and select the Inverted option to exclude these CNVs.
Next, let’s add a filter on the Potential Gene Impact Scores field. In the newly added filter, select the Greater than 0 option to exclude all CNVs that have no impact on a protein coding gene.
PhoRank Gene Ranking
By using the ACMG Classifier, we have obtained a much more manageable set of CNVs, but we can further narrow down our list of filtered CNVs by taking into account the patient’s phenotypes. This can be done using our PhoRank phenotype ranking algorithm, which ranks each CNV and gene based its relationship to the patient’s phenotypes. To run the PhoRank algorithm, select Add > Computed Data then, from the Table Type dropdown menu, select CNV. Next, choose the CNV PhoRank Gene Ranking algorithm and click OK, then OK again.
This will launch the PhoRank dialog, which will allow you to enter a comma separated list of HPO phenotype terms. Once you have entered the patient’s phenotypes, click OK to run the algorithm. For this tutorial we will be using the following phenotype list: Hypodontia, Keratitis, Lymphedema, Immunodeficiency, Cerebral cortical atrophy.
Once the algorithm finishes running, we can filter out CNVs that are not relevant to the patient’s phenotype by adding a filter on the Gene Rank field. Scroll all the way to the right in the CNV table to find Gene Rank. This field assigns a value between 0 and 1 to each gene based on its relationship to the specified phenotypes, with higher values indicating a stronger relationship. For this tutorial, we will filter down to CNVs overlapping a gene with a score exceeding 0.8.
This filtering strategy results in a small set of 6 clinically relevant CNVs. If you scroll to the middle of the CNV table and examine the ACMG Classification for these variants, you will notice that a deletion in the gene IKBKG has been classified as Likely Pathogenic.
In the next section we will use VSClinical to interpret this deletion in accordance with the ACMG Guidelines.
ACMG CNV Guidelines
To open the ACMG Guidelines, open a new tab by clicking on the + icon and select VSClinical. Then select ACMG Guidelines from the drop-down menu.
This will open a dialog which will ask you to specify which assessment catalogs to use. New catalogs can be created using the Create buttons to the right or you can choose to create all missing catalogs at once using the Create Missing Catalogs button at the bottom of the interface. These assessment catalogs will be used to save variant, CNV, and gene level interpretations.
If you choose to create the catalogs using the Create buttons on the right, the create catalog window will first ask you to determine the database type. The options are SQLite, PostgreSQL, MySQL, or if you have the added feature, VSWarehouse.
If not using VSWarehouse, a common choice is using the SQLite option to save the catalog locally. Once you have specified the the catalog name, location, and database type, click OK.
The Create Missing Catalogs button at the bottom of the screen will automatically create SQLite catalogs, name them, and save them to your assessment catalogs folder. Once the assessment catalogs are selected or created click, OK.
The next dialog will prompt you to download required and recommended annotations which will be used in the evaluation of the CNV. In this dialog, you can also lock the versions of the annotations that are being used so that even when a new version of the annotation is available, VSClinical will used the locked version for evaluations. Once the required annotations are downloaded click Close.
The final step before creating an evaluation is to create record sets to track both CNVs and variants for reporting. Click the Create Default CNV icon for the Primary Findings, Secondary Findings, and Uncertain Significance sections, then click Apply.
Now that we have created our catalogs and record sets, we can finally begin our evaluation by clicking the Start New Evaluation button.
This will create a new evaluation and open the Evaluation tab. In the Evaluation tab, variants and CNVs can be added to the current evaluation either manually or from the current project. Across the top are the Genes, Variants, CNVs, Phenotypes, and Report tabs which can be used to navigate to these sections at any time.
To add CNVs to the evaluation scroll down to the CNVs to Evaluate section and click the Add CNVs From Project button.
Notice that the IKBKG exon 4-5 deletion is listed as a result of the filtered CNVs from the project. To add this CNV to the staging list, select the checkbox next to the CNV and click Prepare to Add. Once the CNV appears in the staging list to the right, click Add 1 CNVs to add the deletion to the evaluation.
Next, click on the CNV listed in the CNVs to Evaluate section to analyze this CNV in the CNVs tab. Alternatively, you can click on the CNVs tab at the top to open the CNVs section.
At the top of the CNV Evaluation tab, there is a summary and description of the IKBKG exon 4-5 deletion. Scroll down to the CNV Interpretation for Sample section, which displays the CNV level interpretation information.
Next, let’s fill in some of this information. First, change the Reporting As dropdown from Don’t Report to Primary Findings. Next, we will add the CNV Summary information from the scoring section on the right to the Interpretation box on the left. To do this, click on More.. underneath the CNV Summary to expand the paragraph. Then hover over the small blue plus sign and choose Add Text to My Interpretation.
To save this interpretation click Review & Save Now then click Save and Close in the next dialog.
Next, we will examine the gene level information and score our CNV. Our interpretation for this CNV in the context of the IKBKG gene begins in the Genomic Region section. Graphically, we can see the genomic region and the information displayed from a wide range of CNV annotation sources. The first graph contains information from the ClinGen Dosage Sensitivity track. The red bar in the graph graph indicates that the IKBKG gene is a known haploinsufficient gene. We also notice that deletions in this region are not common in GnomAD or 1KG. You can click on any of the graphs to get more information from these annotations.
At the bottom of this plot is the scoring criteria for assessing whether the CNV overlaps any commonly deleted regions and whether the CNV overlaps a large number of genes. For this CNV, criterion 3A is scored as the CNV partially overlaps the IKBKG gene so only one gene is affected by the deletion. The reasoning for the auto-recommended criteria is displayed at the bottom of this card.
To continue the CNV interpretation and scoring on a gene level scroll down to the Overlapping Genes section. This section displays the scoring information for all 5 sections. Criterion 1A is applied as this CNV overlaps a protein-coding gene, and criterion 3A is applied as it only overlaps a single gene. Notice that section 2 is scored as 2E. Let’s take a look at the reason for this recommendation by clicking on the Edit button next to the section 2 scoring.
The Section 2 scoring wizard is set up in a decision-tree structure to guide the user through the assessment of the CNV’s impact on the gene. The 2E score indicates that this deletion is a frameshift and is predicted to result in nonsense mediated decay. Displayed at the bottom of the scoring wizard are the previously answered questions that lead to the final 2E classifications. You can click through the other decision points to explore the scoring wizard, by clicking on Partially Overlaps HI and going through the guided prompts.
Click on the small x at the top of the wizard to return to the CNV scoring summary.
Notice that the scoring for Section 4 has been “Skipped for HI gene”. This is because IKBKG is a gene with well established evidence for haploinsufficiency according to ClinGen, thus Section 4 is not needed to establish dosage sensitivity.
The last section to score is Section 5. This section does not have any auto-recommended scoring criteria, as this section is sample specific and must to be scored manually. Click on the Start button for Section 5.
This opens up the Section 5 scoring wizard. The first question is “Is the inheritance status of the CNV known?” In this case the CNV inheritance in unknown, so click on the second option Inheritance Unknown.
The next section asks whether the phenotype is consistent with what is expected for the gene. For this question, we will select the first option, as the patient’s phenotypes are consistent with what has been reported in other patients with LoF mutations in IKBKG.
For the final question we are asked whether the patient’s phenotype is highly specific. For this question we will answer that the phenotype is non-specific and apply the criterion 5G, resulting in an additional 0.10 points.
Exit the Section 5 wizard to see the scoring summary and classification for the CNV. The additional 0.10 points from Section 5 when combined with the criterion from Section 2 has resulted in a classification of Pathogenic.
Next we will fill in the Notes on Relevance to Patient and Notes on the Gene sections. Hover over the blue + icon under Auto Interpretation of Gene Impact and select Append to Notes on Relevance to Patient. Then under Gene Summary, hover over the blue + icon and select Add to Notes on Gene.
Now we are finally ready to generate a clinical report including the IKBKG exon 4-5 deletion.
Using CNV Interpretations in Reports
Start by clicking on the Report tab. Once in the Report tab, click on the gray Microsoft Word symbol on the right.
To create a word template, click the New Report Template button.
There are three report templates to choose from, but for this tutorial we will be using the Mendelian Disorder Template. For the New Template Name, type CNV Exome Tutorial, then click Create.
Next click Render to create the report.
As you scroll through the report, you will notice that patient level information is shown at the top of the report, while variant summary information is shown beneath. Scrolling to the Primary Findings section, you will see that all interpretation information associated with the IKBKG exon 4-5 deletion has been automatically filled in.
This tutorial was designed to provide a demonstration of VarSeq’s CNV calling and interpretation capabilities in the context of whole exome sequencing panels.
If you are interested in getting a demo license to try out this and other features please request a demo from: Discover VarSeq
Additional features and capabilities are being added all the time, so if you do not see a feature you need for your workflows please do not hesitate to let us know!