Welcome to the NGS Family Analysis Tutorial!
Updated: January 28, 2021
Version: 8.6.0 or higher
This tutorial covers an advanced workflow that performs a combination of functional and sequence quality control measurements to filter the DNA sequence data to obtain a small number of candidate de Novo, rare-recessively inherited, and compound heterozygous mutations.
To complete this tutorial you will need to download and unzip the following file, which includes a starter project for this tutorial.
Download Necessary Tracks
Before you begin this tutorial, you will need to download a few different annotation sources used in the tutorial.
NOTE: For this tutorial, you will also need the annotation track RefSeq Genes 105 Interim v1, NCBI. However, this track is available locally by default, so you should not need to download it.
Download the annotation tracks below through the SVS Data Source Library. You can access this dialog by selecting Tools > Manage Data Sources from the SVS Project Navigator.
Download Annotation Sources
The NHLBI ESP6500 Exomes Variant Frequencies annotation source contains minor allele frequencies from the NHLBI Exome Sequencing project.
The dbNSFP is an integrated database of functional annotations from multiple sources for the comprehensive collection of human non-synonymous SNPs (NSs). Its current version includes a total of 82,832,027 non-synonymous SNV and splice site SNVs. It compiles prediction scores from 14 prediction algorithms (SIFT, Polyphen2, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST, PROVEAN, FATHMM-MKL coding and fitCons), eight conservation scores (phyloP46way_primate, phyloP46way_placental, phyloP100way_vertebrate, phastCons46way_primate, phastCons46way_placental, phastCons100way_veterbrate, GERP++ and SiPhy) and other function annotations.
You will need to download these source via the Data Source Library. These are large files and the downloads may take a while depending on your internet connection.
- From SVS, choose Tools > Manage Data Sources.
- Click Public Annotations to bring up a list of tracks available on our data server.
- Navigate to NHLBI ESP6500SI-V2-SSA137 Exomes Variant Frequencies 0.0.30, GHI and check this annotation track. (shown in Figure D-1).
- Navigate to dbNSFP Functional Predictions 3.0, GHI and check this annotation track.
- Additionally, to complete Variant Transcript Annotation in Part 3 of the tutorial, you will need a local copy of the Reference Sequence source. Navigate to Reference Sequence GRCH37 g1k, 1000Genomes and check this annotation track. NOTE: In Part 3, you will NOT be prompted to select this track. Instead, if you do not have a local copy, the source information will automatically be read over the network, which will be much slower.
- Click Download.
After the downloads have finished, the tracks are automatically placed in the correct directory, and you may proceed with the tutorial itself.
The advent of high-throughput sequencing (often called Next Generation Sequencing) allows for affordable assaying of not just common and pre-defined lists of genetic markers, but every variant covered by the sequencing target. See A Hitchhiker’s Guide to Next Generation Sequencing – Part 3 for a discussion of the motivation for this technology in different experiment types and an overview of the upstream bioinformatic steps to get to a variant list.
For whole genome sequencing, this can easily result in 3-4 million variants for an individual human. For exome sequencing, one hundred thousand variants are commonly identified in and around gene regions. But because many variants are private to individual samples, when analyzing a group of samples you may have much larger total variant numbers.
In essence, when using microarrays to genotype samples, a filtering process was already used to select the common variants that went into the probe design of a microarray platform. Now that we are starting with the full set of rare and common variants, you get to choose the appropriate filters and subset of variants that are appropriate for your downstream analysis.
This tutorial covers three different but overlapping analysis workflows aimed at narrowing down the variant set to different types of polymorphisms; de Novo Mutations, Compound Heterozygous regions and Rare Recessively inherited mutations. The exome dataset used in the tutorial is a publicly available trio from 1000 Genomes.
2. Filter on Genotype Quality Metrics
- Open the project that was just downloaded.
In the project, you will find four spreadsheets containing variant calls, genotype quality, read depth and allelic depth information. Each spreadsheet has three rows corresponding to the three samples in the trio and about 2.3 million columns. Since this is exome data, we would expect that many of these variants are not actually in exomes but rather have very low quality due to multiple mappings in the genome. The first step of filtering on quality score, read depth, and allelic depth will presumably eliminate many of the columns.
First, the allelic depths spreadsheet must be processed so that the values are in ratio format.
- Open CEU Trio – Allelic Depths (AD) – Sheet 1 and choose DNA-Seq > Calculate Alt Read Ratio.
A new subset spreadsheet called Alt Read Ratio is created. Close all open spreadsheets.
- Open CEU Trio – Genotypes Pedigree Spreadsheet.
- Select DNA-Seq > Set Genotypes to No-Call based on Additional Spreadsheets.
- Select Add Spreadsheet(s).
- Highlight CEU Trio – Read Depths (DP) – Sheet 1, CEU Trio – Genotype Qualities (GQ) – Sheet 1 and Alt Read Ratio. Click OK.
- The dialog should look like Figure 2-1. Click Next>.
- In the first tab, for Read Depth (DP), leave <= in the drop down box and enter 10 as the threshold value. See Figure 2-2.
- In the second tab, for Genotype Quality (GQ), leave <= in the drop down box and enter 40 as the threshold value. See Figure 2-3.
- In the third tab, for Alt Read Ratio, check Zygosity Based Filtering Options.
- After Convert Ref_Ref to ‘?_?’ if, change the direction to >= and enter 0.15 as the threshold value.
- After Convert Ref_Alt to ‘?_?’ if, select Filter by Range and change inside of to outside of in the drop down menu and enter in 0.3 for the lower bound and 0.7 for the upper bound.
- After Convert Alt_Alt to ‘?_?’ if, keep the direction and enter 0.85 as the threshold value.
- This tab should now look like Figure 2-4. Click OK.
NOTE: The thresholds used in this filtering step were selected based on the empirical distributions and may not be appropriate for all datasets. If using this step on your own data, we recommend investigating the distributions of these metrics to determine appropriate thresholds.
As expected, the quality filter removed 2,824,356 calls, which resulted in 2,147,895 columns containing all-missing values. (These columns are now inactivated in the genotype spreadsheet.)
- Rename the subset spreadsheet to High Quality Variants.
3. Annotate Variant Effect on Transcripts
Annotate Variant Effect on Transcripts is available as a part of the Annotate and Filter Variants tool. We will use this tool to remove variants with low or unknown functional effect, as well as any variants outside of exon regions that may have passed the previous quality filters.
NOTE: This tool will automatically use the Reference Sequence GRCH37 g1k, 1000Genomes track, whether you have a local copy of it or not. If you have not already done so, you should download this track to your local directory.
- From the High Quality Variants spreadsheet go to DNA-Seq > Annotate and Filter Variants, then choose Add Track(s).
- Select the RefSeq Genes 105 Interim v1, NCBI gene annotation source, which should be in your local directory.
- Make sure the Annotate Variant Effect on Transcripts option is selected at the bottom of the dialog. See Figure 3-1.
- Click Next >.
The options dialog allows for two annotation report outputs that may be selected. The first is a Variant Report that includes a summary of the computed interactions between each variant and the overlapping transcripts. In the case of multiple interactions, the interaction with the highest priority will be listed.
The second report output is a Variant Interaction Report that can include auxillary transform fields. This output displays the computed interactions between each variant and the overlapping transcripts at that location. Additionally, certain useful statistics and HGVS nomenclature are calculated for each variant-transcript pair.
- Make sure Variant Report and Variant Interactions Report are checked along with the Include Auxillary Transform Fields option.
Optional filtering is also available on this dialog. We will set up a filter using the Effect (Combined) field provided in the Variant Report output. For this filter we will be choosing to keep any variant considered Loss of Function or Missense, which allows us to remove any variant likely to have low or unknown effect on the transcript’s functional product.
- Under Optional Filters: click the plus icon to add a filter.
- From the Filter on: drop down select the Effect (Combined) field and set the LoF and Missense options. The dialog should look like Figure 3-2.
- Click Next > and then Finish on the last dialog.
Three output spreadsheets are created, as well as a results message (Figure 3-3) and a subset spreadsheet, High Quality Variants – Filtered Subset.
The subset spreadsheet, High Quality Variants – Filtered Subset, contains variants that were classified as Missense or LoF variants as reported in the Effect (Combined) column of the Variant Report. This subset of variants can have the following ontologies. (See Annotate Variant Effect on Transcripts for further details.)
Missense: The variant will cause at least one amino acid to change or cause a premature start codon in the UTR5. The ontologies included in this category are: disruptive_inframe_deletion, disruptive_inframe_insertion, inframe_deletion, inframe_insertion, 5_prime_UTR_premature_start_codon_gain_variant, missense_variant.
LoF: The variant is likely to cause the transcript’s product to lose function. The ontologies included in this category are: transcript_ablation, exon_loss_variant, stop_lost, stop_gained, initiator_codon_variant, frameshift_variant, splice_acceptor_variant, splice_donor_variant.
Now, to clean up the project navigator, create a top-level spreadsheet.
- From High Quality Variants – Filtered Subset, choose File > Create Top-Level Spreadsheet.
- Enter the name High Quality – Functional Variants and click OK.
NOTE: At this point, you could also close and minimize all nodes except for the top-level node just created. The view of the project navigator would then look like Figure 3-4.
4. Find de Novo Mutations
The first analysis procedure demonstrated in this tutorial aims to find de Novo mutations in the affected child that could explain the phenotype. A variant that is a candidate will have a heterozygous call in the affected child and reference calls for the parents. Thus the variant would be classified as a Mendelian Error.
- From High Quality – Functional Variants – Sheet 1 choose DNA-Seq > Find de Novo Candidate Variants.
- Leave the default options and click OK. See Figure 4-1.
NOTE: In some cases, you may wish to include homozygous alternate calls in the affected child as candidates. This option is not selected by default as it would be extremely rare and thus considered a genotyping anomaly.
Two new spreadsheets are created. One (De Novo Candidate Variants) is a report that lists all of the candidates in the row labels with the total count of de Novo variants in the first column. (The count is always 1 for this example–the count would have been more informative if multiple trios had been in the spreadsheet.) The other new spreadsheet is a candidate genotype subset, High Quality – Functional Variants – NA12878 Candidates.
Next, validate the results by looking at the reads (BAM files) in GenomeBrowse.
- Open the candidate genotype spreadsheet, High Quality – Functional Variants – NA12878 Candidates, and choose GenomeBrowse > Variant Map.
By default, the plot item will be called Variants.
- Right-click on Variants in the plot tree and choose Edit Title…. Call this plot item de Novo Candidates.
- Hover the mouse cursor inside of the variant map plot area. You should see a gear icon and a spreadsheet icon appear in the top-left corner of this plot area (below de Novo Candidates and to the right of NA12891). Click on the spreadsheet icon to open a Feature list. See Figure 4-2.
- Click on the first variant in the feature list (1:1225707-SNV). The view will zoom to that variant.
- Now add the BAM files the correspond with these samples. Click on the Plot button in the upper left corner of the GenomeBrowse window. Then click the black triangle next to the Example Samples location to expand the folder options.
- Under the CEPH 1463 Trio – High Coverage Exomes folder find and check NA12878.mapped…, NA12891.mapped… and NA12892.mapped…. See Figure 4-3.
- Click Plot & Close.
NOTE: In the GenomeBrowse window, you can uncheck some of the plots, such as the Pile-up views, so that everything can be visible on one screen.
Note that the feature list now contains only the one entry for 1:1225707-SNV.
- On the upper-right corner of the feature list area, for Read:, select All.
- Near the lower-left corner of the feature list area, click read more.
This will show the entire feature list of 1,461 features. You can now use this list to scroll through all of the candidate variants to compare the variant calls to the reads in the BAM file. See Figure 4-4.
5. Compound Heterozygous Polymorphisms
A compound heterozygous polymorphism refers to a child that has inherited two different heterozygous polymorphisms within the same gene, one from each parent. This could result in both copies of the gene being potentially affected.
This type of polymorphism should also alter the amino acid sequence, or be classified as something other than synonymous. Therefore it is appropriate to start from the High Quality – Functional Variants variant set. One would also expect this type of polymorphism to be rare, so an additional minor allele frequency filter is applied.
- From High Quality – Functional Variants – Sheet 1, choose DNA-Seq > Annotate and Filter Variants.
- Click Add Track(s) and check NHLBI ESP6500SI-V2-SSA137 Exomes Variant Frequencies 0.0.30, GHI. Click Select.
- Make sure Annotate Variant Effect on Transcripts is checked and click Next>.
In this dialog, the user can choose to simply annotate the variants or to also apply a filter based on a numeric field in the annotation track. In this case, apply a filter based on the minor allele frequency calculated from the European American population. Then use the filtered variants to create a rare variant subset.
- Select the plus sign on the right of the Configured Filters: option, then choose European American MAF from the Filter on: dropdown menu. Enter 0.01 as the upper bound.
- The dialog should look like Figure 5-1. Click Next>, then on the summary page that appears, click Finish.
An annotation report, NHLBI ESP6500SI-V2-SSA137 Exomes Variant Frequencies 0.0.30, GHI – Annotation Results, is created that contains all variants that were found in both the spreadsheet and the annotation source. An applied filters output, High Quality – Functional Variants – Sheet 1 – Applied Filters, which contains filter information for every variant in the original spreadsheet, is also created.
Finally, a rare variant subset spreadsheet, High Quality – Functional Variants – Sheet 1 – Filtered Subset, is created.
- Open High Quality – Functional Variants – Sheet 1 – Filtered Subset.
- Rename this subset spreadsheet to Rare Variants.
Next, filter the variant set to only include columns that have a heterozygous call for the child. Do not use the parent genotypes in this filter, since at a given variant the mother will be heterozygous while the father will not, and vice-versa.
- From Rare Variants, choose DNA-Seq > Activate Variants by Sample Genotypes.
- Check Alt_Ref under the child’s label NA12878.
- The dialog should look like Figure 5-2. Click OK.
- Create a column subset of this spreadsheet (Rare Variants). Rename the subset spreadsheet to Rare Variants – Heterozygous in Child.
Next, use a tool that scans the spreadsheet for compound heterozygous events and generates a report of events per gene.
- From Rare Variants – Heterozygous in Child, choose DNA-Seq > Score Compound Heterozygous Regions.
- Check Score Compound Heterozygous Variants (Create additional per variant output).
- The dialog should look like Figure 5-3. Click OK.
Two reports are generated that contain counts of heterozygous genes per trio and the heterozygous mutations found within those genes.
- Open Score Compound Heterozygous Genes.
- Right-click on the first column header Total Compound Het and choose Value Counts.
The 1 category in the resulting output message will give you a count (in this example, 53) of the Compound Heterozygous Genes in the dataset. Next, visualize these variants in GenomeBrowse.
- Open Score Compound Heterozygous Variants and choose Select > Apply Current Selection to Second Spreadsheet.
- Change columns to rows and leave the selected spreadsheet.
- The dialog should look like Figure 5-4. Click OK.
- In spreadsheet Rare Variants – Heterozygous in Child – Sheet 2, reactivate the pedigree columns by clicking once on each column header.
- Create a column subset (of this spreadsheet) and rename the subset Compound Het Candidates.
- Open the GenomeBrowse window, Variant Map of High Quality – Functional Variants – NA12878 Candidates. This window was created in Section 4 of this tutorial (Find de Novo Mutations).
- Click Plot, then click Project.
- Highlight Compound Het Candidates and check Variants. See Figure 5-5.
- Click Plot & Close.
By default, the plot item will be called Variants.
- Change this to Compound Het Candidates (right-click the plot item and choose Edit Title…). Then drag the plot below the BAM files.
- Change the Feature List dropdown (near the top of the Feature List box) to Compound Het Candidates.
You may now go through the Compound Het Candidates list to check for data quality and for comparison with de Novo Candidates.
NOTE: If you wish, you can match Figure 5-6 precisely by clicking on row 41 of the (Compound Het Candidates) feature list, moving the the red dashed vertical line to almost the left side (by moving your cursor appropriately), then clicking “-” on the zoom bar until the next compound het candidate shows up toward the right side of the plot.
If the calls can be confirmed to be accurate and of high quality, the gene could be a good candidate for causing the affected phenotype.
6. Rare Recessive Homozygous Polymorphisms
This type of polymorphism refers to a trio in which the parents are both heterozygous for the same polymorphism and the child inherited both alternate alleles, making the child homozygous for the recessive allele.
With this type of polymorphism, one would expect it to be found within a rare functional variant set. Therefore, begin with the Rare Variants subset. The inheritance pattern is different from the one assumed in the previous step, so use the same tool to filter down to variants that follow the expected inheritance pattern.
- Open Rare Variants and choose Select > Activate All.
Rare Variants – Sheet 2 is created.
- Now choose DNA-Seq > Activate Variants by Sample Genotypes.
With this type of polymorphism, you want to filter down to variants that are Alt_Alt for the child and Alt_Ref for the parents.
- Under NA12878 check Alt_Alt. Under NA12891 and NA12892 check Alt_Ref.
- The dialog should look like Figure 6-1. Click OK.
Five variant columns (11 total columns) remain active, meaning that the genotypes in these columns follow the specified inheritance pattern.
- Create a column subset and rename it Rare Variants – Rare Recessive.
Open the previously created GenomeBrowse window (Variant Map of High Quality – Functional Variants – NA12878 Candidates) to investigate the variants further.
In your own study, if you are able to confirm the variant calls in GenomeBrowse, the variants could be candidates for further validation.
This tutorial demonstrated different workflows to narrow down a variant set to high quality functional variants that were further narrowed down to three different types of polymorphisms.
You could use the techniques demonstrated in this tutorial to discover de Novo candidate, compound heterozygous and rare recessive homozygous polymorphisms. Any of these polymorphism types could cause an undesirable phenotype.