Welcome to the SVS Introduction to Next-Generation Sequencing (NGS) Tutorial!
Updated: January 28, 2021
Version: 8.6.0 or higher
This tutorial covers the introductory steps and procedures that will prepare your dataset for further analysis and filtering on NGS (Next-Generation Sequencing) data. The steps include data import, annotation track download and filtering, and variant classification.
To complete this tutorial you will need to download and unzip the following VCF files.
This tutorial covers the introductory steps aimed to prepare a dataset created by a Next-Generation sequencing pipeline for further NGS analysis or filtering. The topics covered include data import (both VCF files and Complete Genomics files are discussed), annotation source download and filtering and variant classification.
This tutorial is meant to provide information about different import and filtering techniques. Since there are a number of different workflows that may be suitable depending on the study design and analytic needs of the researcher, many of the techniques are only discussed and not performed.
The VCF files used in this tutorial were processed by Golden Helix using a secondary analysis DNA-seq pipeline provided by Seven Bridges Genomics. The BAM files were downloaded from the 1000 Genomes ftp site and used as input to the pipeline, which then created the VCF files.
In addition to VCF files, SVS can also import Complete Genomics Var files. The Var files are expected to be in tab-separated format (.tsv) or a compressed version (.tsv.gz, .tsv.bz2). Several files can be imported simultaneously into SVS using a tool found in the import menu, Import > Import Complete Genomics Var Files.
NOTE: MasterVar files are not supported by SVS. Complete Genomic provides command line tools through their cgatools software that will convert these files to VCF.
Golden Helix hosts several annotation sources on our data server that are available for SVS customers to download. A few of the more notable tracks are discussed in this tutorial. During the filtering process, the annotation sources can also be accessed directly from the server without having to download local copies, although this can make the process significantly slower.
Finally, Variant Classification is used to classify the coding variants and annotate the protein changes. This step can be done at several different points in the analysis. For example, if you were to continue filtering, you could postpone this step until the end of the workflow and only classify potential candidates.
2. Import Variants and Quality Fields
First create a new project.
- Open SVS and Login then click Create New Project.
- After Project name: type Intro to NGS and choose an appropriate directory in which to store the project.
NOTE: The Default Genome Assembly should be set to Homo sapiens (Human), GRCh37 (hg19) (Feb 2009). This is the appropriate build for this tutorial. If you are following this procedure with your data, and it aligns to a different build, click Change… to select the appropriate Genome Assembly.
- Click OK.
Your view is now of the Project Navigator. Next, import the previously downloaded VCF files.
- Select Import > Import VCFs and Variant Files.
- Click Add Files and navigate to the previously downloaded files. Select all the files and click Open. The dialog should look like Figure 2-1.
Click Next > and the files will be compressed and indexed to facilitate import. The dialog will look like Figure 2-2 while this process is taking place.
NOTE: The compressing and indexing phase could take a couple of minutes, depending on your machine’s hardware.
- Select the Family Samples import relationship (Figure 2-3), then click Next >.
- Now indicate the pedigree and affection status for each sample in accord with Figure 2-4. For SVS, True means affected and False means unaffected. Then click Next >.
In the next dialog, there are several import options based on the contents of the VCF files. When importing your own dataset, you can choose to select as many or as few of these options as you would like.
- In this dialog, check the following FORMAT (Sample) fields:
- Allelic Depths (AD)
- Read Depths (DP)
- Genotype Qualities (GQ)
- Genotypes (G_T)
- The dialog should now look like Figure 2-5. Click Next >.
NOTE: Depending on the Variant Caller used to produce the VCF files, the Allelic Depths (AD) field may not be provided. It is a common field for VCFs produced by GATK but SAMtools does not provide it. If the field is not available then skip over the sections of this tutorial that refer to it.
- On the last dialog enter the Sheet Base Name as 1kG CEU Trio and leave the rest of the options as defaults. The dialog should look like Figure 2-6. Then click Finished.
NOTE: The import phase should take a couple of minutes, depending on your machine’s hardware. The resulting spreadsheets will have ~2.3 million columns. VCF files created from Complete Genomics may be much larger and thus the import time may increase considerably.
Four spreadsheets are created after the import finishes:
- 1kG CEU Trio – Genotypes (G_T) – Sheet 1,
- 1kG CEU Trio – Allelic Depths (AD) – Sheet 1,
- 1kG CEU Trio – Read Depths (DP) – Sheet 1, and
- 1kG CEU Trio – Genotype Qualities (GQ) – Sheet 1.
The first spreadsheet, 1kG CEU Trio – Genotypes (G_T) – Sheet 1, contains the variant calls and will be used for most of the analysis and filtering procedures. The other three spreadsheets contain depth and quality information about the calls and are used to filter against in order to eliminate low quality variants.
See Import VCFs and Variant Files for full details on all the available import options.
3. Quality Filters
Generate Alternate Read Ratio
The Allelic Depths (AD) spreadsheet created upon import contains comma-separated lists of the read depths corresponding to all observed alleles (reference depth, alternate1 depth, etc.)
This spreadsheet can be used to create a useful metric called Alternate Read Ratio or Alt Read Ratio. This value is defined as the percentage of reads that map to the alternate allele. With this in mind, you would expect homozygous alternate calls to have corresponding Alt Read Ratio values close to 1, heterozygous calls to have ratios close to 0.5 and homozygous reference calls to have ratios close to 0.
- Open 1kG CEU Trio – Allelic Depths (AD) – Sheet 1 and choose DNA-Seq > Calculate Alt Read Ratio.
The resulting spreadsheet (Alt Read Ratio) will be used to filter against in the next step.
- Close all open spreadsheets by selecting Window > Close All from the Project Navigator.
Remove Low Quality Variants
- Open 1kG CEU Trio – Genotypes (G_T) – Sheet 1 and select DNA-Seq > Set Genotypes to No-Call based on Additional Spreadsheets.
- Click Add Spreadsheet(s) and highlight 1kG CEU Trio – Read Depths (DP) – Sheet 1, 1kG CEU Trio – Genotype Qualities (GQ) – Sheet 1 and Alt Read Ratio. Click OK. Then click Next>.
NOTE: The threshold values used in this tutorial were determined based on the empirical distributions. In your own study, the thresholds may depend on other factors as well as personal preference.
- In the tab corresponding to Read Depths (DP), enter 10 as the threshold value (Figure 3-1).
In the tab corresponding to Genotype Qualities (GQ), enter 15 as the threshold value (Figure 3-2).
- In the tab corresponding to Alt Read Ratio, check Zygosity Based Filtering Options. This will allow you to set different thresholds for different types of calls.
- After Convert Ref_Ref to ‘?_?’ if, change the direction to >= and enter 0.15 as the threshold value.
- After Convert Ref_Alt to ‘?_?’ if, select Filter by Range and change inside of to outside of in the drop down menu and enter 0.3 for the lower bound and enter 0.7 for the upper bound.
- After Convert Alt_Alt to ‘?_?’ if, keep the direction and enter 0.85 as the threshold value.
This tab should now look like Figure 3-3.
- Click OK.
This tool replaces all calls that had at least one metric below the desired threshold with the missing-value indicator (?_?), ensuring that the remaining data is of high quality. Variant columns that end up with all missing values are inactivated in the genotype spreadsheet.
NOTE: The Set Genotypes to No-Call… tool can be fairly memory intensive. If it uses too much RAM for your machine, you can run the tool on subsets of your genotypes. For example, you could go to Select > Activate by Chromosomes and only select one or several chromosomes to be active, run Set Genotypes to No-Call…, then repeat this process using other chromosomes until all of your data has been filtered.
The Set Genotypes to No-Call… tool also creates a column subset spreadsheet (1kG CEU Trio – Genotypes (G_T) – Genotypes Filtered to No-call – Column Subset) of those markers that remain active after filtering.
- Rename this subset spreadsheet to High Quality Variants by right-clicking on the spreadsheet name at the bottom of the spreadsheet and selecting Rename.
4. Data Sources
Two annotation tracks are available locally by default, RefSeq Genes 105 Interim v2, NCBI for human build GRCh_37_g1k and RefSeq Genes 109 Interim v2, NCBI for human build GRCh_38. Depending on your specific filtering and analysis needs, a number of the tracks listed below may be useful.
To download these tracks, choose Tools > Manage Data Sources from the Project Navigator. This brings up the Data Source Library with your local tracks listed. To bring up a list of available tracks, click Public Annotations.
Below you will find descriptions of many commonly used data sources. Later in this tutorial, you will be using the Annotate and Filter Variants tool, through which you will utilize several of these data sources.
NOTE: You can also stream the following data sources from the cloud rather than downloading local copies. However, depending on your internet connection, using cloud based sources may significantly slow down the process.
- Reference Sequence GRCh37 g1k, 1000Genomes: This track contains the reference nucleotide sequence that will be used to classify variants in the Variant Classification section (Section 6) of this tutorial.
NOTE: In Section 6, you will not be prompted to select the Reference Sequence GRCH37 g1k, 1000Genomes track. Instead, it will automatically be used, whether you have a local copy of it or not.
- NHLBI ESP6500SI-V2-SSA137 Exomes Variant Frequencies 0.0.30, GHI: Used to remove common variants as defined by the NHLBI Exome Sequencing Project.
- gnomAD Exomes Variant Frequencies 2.1.1, BROAD: Used to remove common variants as defined by the Genome Aggregation Database (gnomAD) coalition headed by the Broad Institute.
NOTE: For whole genome data, the data source gnomAD Genomes Variant Frequencies 2.1.1, BROAD is also available for download.
- 1kG Phase3 – Variant Frequencies 5a with Genotype Counts, GHI: Used to remove common variants as defined by the 1000 Genomes Phase 3 project.
- dbNSFP Functional Predictions 3.0, GHI: This data source contains functional predictions from six common algorithms (SIFT, Polyphen2 HumVar (HVAR), MutationTaster, MutationAssessor, FATHMM, and FATHMM-MKL Coding). You can use this tool to remove variants that are predicted as benign by some or all of the algorithms. There is also a more complete version of this database available for download that can be used.
Under the Secure Annotations repository are a list of premium annotations available for add-on to any SVS license. These sources are designed to be annotated directly from the cloud so no local download is required.
These sources include the following:
- OMIM: Genes, Phenotypes, and Variants
- CADD: Raw C-Scores, PHRED-scaled Scores, and estimated scores for novel indels.
- OncoMD: Clincal Trials, Drugs Targeting Mutation, Functional Validation of Variant, Gene Info, Studies with Variant, and Variant Summary information.
NOTE: To add any of these sources to your SVS license, contact firstname.lastname@example.org.
5. Annotate and Filter with Variant Sources
- Open High Quality Variants and go to DNA-Seq > Annotate and Filter Variants, then choose Add Track(s).
- Select the data sources shown in Figure 5-1 from your Local directory. Click Select.
NOTE: You can optionally drag each data source to create a different filtering order, but the results will be the same regardless of order, as filtering is done at the end based on all sources.
The dialog should look like Figure 5-1.
- Click Next>.
- For the dbNSFP data source, under Optional Filters: click the plus icon to add a filter.
- From the Filter on: drop down select the N of 6 Predicted Damaging option, then select the 3 of 6 through 6 of 6 categories. This filter is based on the voting algorithm provided for the dbNSFP Source. The voting algorithm counts the number of prediction algorithms that predict the variant as damaging. The six algorithms used are SIFT, PolyPhen2 HumVar, MutationTaster, MutationAssessor, FATHMM and FATHMM-MKL Coding. The dialog should look like Figure 5-2. See dbNSFP Results for further details.
- Click Next>.
- For the gnomAD data source, under Optional Filters: click the plus icon to add a filter.
- From the Filter on: drop down select the Alt Allele Freq (AF) field and set the upper bound to be 0.01. The dialog should look like Figure 5-3.
- Click Next>.
- For the 1kG data source, under Optional Filters: click the plus icon to add a filter.
- From the Filter on: drop down select the Allele Frequencies field and set the upper bound to be 0.01. The dialog should look like Figure 5-4.
- Click Next>.
- For the NHLBI data source, under Optional Filters: click the plus icon to add a filter.
- From the Filter on: drop down select the All MAF field and set the upper bound to be 0.01. The dialog should look like Figure 5-5.
- Click Next>.
The last dialog will summarize all of the options selected. Click Finish.
When all annotation and filtering is finished the results dialog (Figure 5-6) will summarize the results of the workflow.
- The following spreadsheets are created:
- An annotation result spreadsheet for each of the selected sources.
- A spreadsheet with the title High Quality Variants – Applied Filters, which contains boolean fields for each filter with 1 if the variant passes the filter and 0 if it does not. The first column of this spreadsheet, Is Filtered?, indicates whether a variant has passed all the filters.
- Lastly, a spreadsheet that is a filtered down version of the original data set, called High Quality Variants – Filtered Subset, which contains only variants that have passed all the specified filters.
- Rename the final spreadsheet (High Quality Variants – Filtered Subset) to High Quality + Rare + Predicted Damaging.
6. Annotate Variant Effect on Transcripts
Annotate Variant Effect on Transcripts is available as a part of the Annotate and Filter Variants tool.
NOTE: This tool will automatically use the Reference Sequence GRCH37 g1k, 1000Genomes track, whether you have a local copy of it or not. If you have not already done so, you should download this track to your local directory.
- Open the High Quality + Rare + Predicted Damaging output and go to DNA-Seq > Annotate and Filter Variants, then choose Add Track(s).
- Select the RefSeq Genes 105 Interim v1, NCBI gene annotation source.
- Make sure the Annotate Variant Effect on Transcripts option is selected at the bottom of the dialog. See Figure 6-1.
- Click Next >.
The options dialog allows for two annotation report outputs that may be selected. The first is a Variant Report that includes a summary of the computed interactions between each variant and the overlapping transcripts. In the case of multiple interactions, the interaction with the highest priority will be listed.
The second report output is a Variant Interaction Report that can include auxillary transform fields. This output displays the computed interactions between each variant and the overlapping transcripts at that location. Additionally, certain useful statistics and HGVS nomenclature are calculated for each variant-transcript pair.
- Make sure Variant Report and Variant Interactions Report are checked along with the Include Auxillary Transform Fields option.
Optional filtering is also available on this dialog. We will set up a filter using the Effect (Combined) field provided in the Variant Report output. For this filter we will be choosing to keep any variant considered Loss of Function or Missense, which allows us to remove any variant likely to have low or unknown effect on the transcript’s functional product.
- Under Optional Filters: click the plus icon to add a filter.
- From the Filter on: drop down select the Effect (Combined) field and set the LoF and Missense options. The dialog should look similar to Figure 6-2.
- Click Next > and then Finish on the last dialog.
Three output spreadsheets are created as well as a results dialog (Figure 6-3).
A new genotype spreadsheet is also created called High Quality + Rare + Predicted Damaging – Filtered Subset.
- Rename this spreadsheet to Candidate Variants by right-clicking on the title of the spreadsheet and selecting Rename.
The subset spreadsheet contains variants that were classified as Missense or LoF variants as reported in the Effect (Combined) column of the Variant Report. This subset of variants can have the following ontologies. (See Annotate Variant Effect on Transcripts for further details.)
Missense: The variant will cause at least one amino acid to change or cause a premature start codon in the UTR5. The ontologies included in this category are: disruptive_inframe_deletion, disruptive_inframe_insertion, inframe_deletion, inframe_insertion, 5_prime_UTR_premature_start_codon_gain_variant, missense_variant.
LoF: The variant is likely to cause the transcript’s product to lose function. The ontologies included in this category are: transcript_ablation, exon_loss_variant, stop_lost, stop_gained, initiator_codon_variant, frameshift_variant, splice_acceptor_variant, splice_donor_variant.
7. Combining Results
Depending on your analysis, you may wish to only look at a subset of the classifications. For example, you may only be interested in those classified as Non-synonymous and compare the results to the annotation report created by the dbNSFP filtering tool.
First create a collated spreadsheet of the genotype, read depth, genotype quality, and alt read ratio information so it is all available in the same spreadsheet for comparison.
- From the Project Navigator go to Tools > Build Sample Collated Spreadsheet.
- Click Add Spreadsheet(s) and highlight the Candidate Variants, 1kG CEU Trio – Read Depths (DP) – Sheet 1, 1kG CEU Trio – Genotype Qualities (GQ) – Sheet 1, and Alt Read Ratio spreadsheets, then click OK.
- The dialog should look like Figure 7-1. Click Next>. There may be a pause of several seconds before the next dialog comes up.
The next dialog will ask you to choose the column headers for the selected data. Change the Suffix under Candidate Variants to GT and the Suffix under Alt Read Ratio to AD so the dialog looks like Figure 7-2. Then click OK.
Once complete you should have a spreadsheet with your markers listed in the row labels and your samples listed in groups of 4 columns per sample that contain your GT, DP, GQ, and AD information.
Now we will join up the collated data with the coding classification output as well as the dbNSFP annotation report so the data will be available for comparison in the same spreadsheet.
- From Sample Collated Spreadsheet – Sheet 1 go to File > Join or Merge Spreadsheets, select the dbNSFP Functional Predictions 3.0, GHI – Voting Report output, then click OK. Leave the default options in the Join dialog and click OK.
- Then from the joined Sample Collated Spreadsheet + dbNSFP Functional Predictions 3.0, GHI – Voting Report – Sheet 1 again select File > Join or Merge Spreadsheets. Highlight the RefSeq Genes 105 Interim v1, NCBI – Variant Report output and click OK.
- Enter Collated Variants + dbNSFP + Variant Effect under the New dataset name: option. Leave the other options at their defaults and click OK.
NOTE: If you wish to join several spreadsheet automatically we have an add-on script available for this purpose: Join or Merge Several Spreadsheets.
The final spreadsheet should have your sample level information followed by dbNSFP annotations and lastly your transcript annotations for each variant in your candidate variant list.
The data can then be exported from SVS by going to File > Save As… > Text or Third Party Format and selecting your output choices.