Welcome to the SVS Variant Filtering on Unrelated Samples Tutorial!
Updated: March 19, 2019
Version: 8.6.0 or higher
This tutorial covers a comprehensive workflow for filtering variants based on public annotations data, classifying variants according to position relative to a gene transcript, classifying variants according to the variants’ effect on the amino acid sequence, and visualizing variants of interest.
- To complete this tutorial you will need to download and unzip the following file, which includes a starter project that contains data from public 1000 Genomes Phase3 sample sequence data, as well as simulated phenotype information.
- You will also need to download and install the add-on script Activate Variants by Genotype Count Threshold which you can find on our scripts repository web page along with instructions for installing in the appropriate directory for use in SVS.
Download Necessary Data Sources
The Golden Helix data repository features several different annotation sources (“tracks”) of varying types and builds. The sources can be accessed through the SVS Data Source Manager and are hosted on our data server (http://data.goldenhelix.com/).
Before you begin this tutorial, you will need to download a few of these annotation data sources so they may be used in the filtering steps.
If you have completed the SVS Introduction to NGS (DNA-Seq) Tutorial, you should have already downloaded all of the tracks mentioned below, and you may skip to Section 1. Overview.
Steps for Downloading
The steps below will guide you through downloading the necessary tracks.
- Open the previously downloaded project.
- Choose Tools > Manage Data Sources to open the Data Source Library.
The tracks that are available locally (have already been downloaded) are listed in the Local directory. You can choose to view or sort your local tracks by name, type, species or build.
- Click Public Annotations to bring up list of available network sources.
The list will be filtered by the Default Genome Assembly selected for the project–in this case, the filter will list only sources with the GRCh_37 build.
- If you click on the drop down arrow next to Homo sapiens (Human), GRCh37 (hg19) (Feb 2009), you can view all other assemblies that are available from our data servers.
Next, select the sources you will need for this tutorial.
- gnomAD Exomes Variant Frequencies 2.0.1 v2, BROAD
- NHLBI ESP6500SI-V2-SSA137 Exomes Variant Frequencies 0.0.30, GHI
- Reference Sequence GRCh37 g1k, 1000Genomes
- Click Download to download the selected tracks to your local machine.
This may take awhile. After the sources finish downloading, close the Data Source Library and Download Window to proceed with the tutorial.
For this tutorial, you will also need the annotation track RefSeq Genes 105 Interim v1, NCBI. However, this track is available locally by default, so you should not need to download it.
The advent of high-throughput sequencing (often called Next Generation Sequencing) allows for affordable assaying of not just common and pre-defined lists of genetic markers, but every variant covered by the sequencing target. See A Hitchhiker’s Guide to Next Generation Sequencing – Part 3 for a discussion of the motivation for this technology in different experiment types and an overview of the upstream bioinformatic steps to get to a variant list.
For whole genome sequencing, this can easily result in 3-4 million variants for an individual human. For exome sequencing, one hundred thousand variants are commonly identified in and around gene regions. But because many variants are private to individual samples, when analyzing a group of samples you may have total variant numbers much larger than these.
In essence, when using microarrays to genotype samples, a filtering process was already used to select the common variants that went into the probe design of a microarray platform. With NGS, we are starting with the full set of rare and common variants, and the user can choose the appropriate filters and variant subsets that are appropriate for the downstream analysis.
In this tutorial, we will focus on the common workflow of filtering down to rare, non-synonymous coding variants that are potential causal candidates for a disease (simulated). The filtering steps demonstrated in the tutorial do not take family structure into account and are therefore appropriate for use on unrelated samples.
2. Rare Variant Filtering
New in SVS is the ability to annotate and/or filter variants by multiple sources in one step. Variants causing rare diseases are not expected to be found in most people, so we will use the following steps to filter to a set of high quality rare variants.
First the variants outside of exon regions will be removed, then common variants will be removed based on both the gnomAD data source and the NHLBI Exome Sequencing Project data source.
Filter using public data sources
- Open the Pheno + 1kG Phase3 chr15-chr16 – Sheet 1 spreadsheet and choose DNA-Seq > Annotate and Filter Variants.
NOTE: This tool will automatically use the Reference Sequence GRCH37 g1k, 1000Genomes track, whether you have a local copy of it or not. If you have not already done so, you should download this track to your local directory.
- Click Add Track(s) and check mark the following sources:
- RefSeq Genes 105 Interim v1, NCBI
- gnomAD Exomes Variant Frequencies 2.0.1 v2, BROAD
- NHLBI ESP6500SI-V2-SSA137 Exomes Variant Frequencies 0.0.30, GHI.
- Click Select.
- The dialog should look like Figure 2-1.
NOTE: The filters are applied in the order the annotation sources are listed–you can drag and drop the sources in this dialog to change the order.
- Click Next > to access the filter options dialog for each source selected.
- In the RefSeq Genes 105 Interim v1, NCBI filter dialog, select the plus sign next to the Configured Filters: options and choose Gene Region (Combined) from the Filter on: dropdown box, then select exon. See Figure 2-2. Click Next >.
- In the gnomAD Exomes Variant Frequencies 2.0.1 v2, BROAD filter dialog, select the plus sign next to the Configured Filters: options and choose Alt Allele Freq (AF) from the Filter on: dropdown box, then enter 0.01 for the upper bound. See Figure 2-3. Click Next >.
- In the NHLBI_ESP6500SI-V2-SSA137 Exomes Variant Frequencies 0.0.30, GHI filter dialog, select the plus sign next to the Configured Filters: options and choose All MAF from the Filter on: dropdown box, then enter 0.01 for the upper bound. See Figure 2-4. Click Next >, then click Finish to start the filter process.
After filtering is complete the results message shown in Figure 2-5 will appear with a summary from each step of the filter.
An Annotation Results spreadsheet is created for each selected track, and a filter results and subset spreadsheet based on this filtering are created.
- Open the Pheno + 1kG Phase3 chr15-chr16 – Sheet 1 – Filtered Subset.
- Rename this spreadsheet Rare Variants.
3. Annotate Variant Effect on Transcripts
Next, the results from the RefSeq Genes annotation will be used to filter out synonymous variants and other variants that have low or unknown functional effects.
- Open the RefSeq Genes 105 Interim v1, NCBI – Variant Report.
The Variant Report output (see Figure 3-1) includes a summary of the computed interactions between each variant and the overlapping transcripts. In the case of multiple interactions, the interaction with the highest priority will be listed. (See Annotate Variant Effect on Transcripts for further details.)
- Right-click on the Effect (Combined) column and choose Value Counts. The report shown in Figure 3-2 will come up.
Note the number of variants that were classified as Other and ”?” (invalid). These variants do not have a functional effect on the transcript and therefore should be dropped from further analysis.
- Once again right-click on the Effect (Combined) column. This time, select Activate by Category.
- Choose LoF and Missense categories for activation (see Figure 3-3) and click OK.
- Now apply this selection to our Rare Variant subset. To do this, go to Select > Apply Current Selection to Second Spreadsheet. Choose to apply filtered rows to the Rare Variants spreadsheet. See Figure 3-4. Click OK.
The Rare Variants spreadsheet with the filtered-out columns deactivated will pop forward.
- Reactivate the phenotype column by clicking once on the C/C column header. Then create a subset spreadsheet by going to Select > Subset Active Data.
- Rename the subset to Rare Functional Variants.
4. Find Variants Unique to Affected Samples
In rare variant analysis, it is often useful to identify variants that are mostly unique to cases and not found frequently in controls. In SVS, it is possible to identify these variants using the Activate Variants by Genotype Count Threshold script that you downloaded at the beginning of the tutorial.
NOTE: The case/control status in this dataset is simulated for demonstration purposes.
- Open the Rare Functional Variants spreadsheet.
- Click once the C/C column header to set the column as the dependent variable. The text in the column should turn magenta indicating the dependent status.
- Choose Select > Activate Variants by Genotype Count Threshold. It should be at the bottom of the Select menu in italics. See Figure 4-1.
- For the C/C=False (Controls) category select to Activate columns that have <0.10 percent of the following genotypes Alt_Ref and Alt_Alt.
- For the C/C=True (Cases) category select to Activate columns that have > 100 occurrences of the following genotypes Alt_Ref and Alt_Alt.
- The dialog should look like Figure 4-2. Click OK.
18 columns should remain active, 17 variants and the 1 phenotype column. The variant columns should be investigated further for significance at an individual variant level.
- Create a column subset by going to Select > Subset Active Data and rename the subset Interesting Variants.
5. Visualization of the Data
- Open Pheno + 1kG Phase3 chr15-chr16 – Sheet 1.
- Then go to GenomeBrowse > Variant Map.
A variant map will be drawn showing all non-reference genotypes from the spreadsheet. Since the data for this project only contains variants for chromosomes 15 and 16, we will zoom into that location on the map by typing chr15-chr16 in the location bar at the top of the screen. Due to the size of the data, it can take a few minutes for the full coverage computation to finish. Once that is done, you should see a density plot of the available variants in this region. See Figure 5-1.
- In the Plot Tree, click on the graph item Variants.
- Then in the Controls dialog, click on the Group by tab and select the C/C field from the drop down box.
The rows of the variant map will be reordered and grouped according to this variable, with color codes along the y-axis to indicate each group. In the default coloring, the control samples will be colored blue (false) and the cases will be green (true).
- Now zoom into a gene of interest by typing the name MRPS11 into the location bar at the top of the window, then clicking on one of the four MRPS11 (Gene Name) 15: 89,010,684 – 89,023,998 entries in the selection menu that drops down from the location bar. See Figure 5-2.
The resulting display is shown in Figure 5-3.
- Now select Plot to create an additional plot. When the Add Data Sources dialog comes up, click Project. Select the Interesting Variants spreadsheet, then check Variants and click Plot & Close. Group the new plot by C/C status the same way as before.
We see that one of the variants from the Interesting Variants spreadsheet is present in this gene. See Figure 5-4.
- Click the back-arrow (the left-facing arrow to the left of the location bar) once to get back to the entire Chromosome 15 – Chromosome 16 display.
- Now open the Feature List view of the upper plot by clicking the spreadsheet icon that appears in the upper left corner of the upper plot when you hover the mouse over the upper plot, or, if you have not yet created a feature list anywhere, by highlighting the first Variants entry in the plot tree and then going to View > Dock Window > Feature List.
You can scroll through the list of identified variants from the Interesting Variants spreadsheet by clicking on a row in the Feature List and using the arrow keys on the keyboard. See Figure 5-5.