Dr. Ken Kaufman’s extremely popular webinar inspired us to build new tools that would simplify the process of analyzing whole-exome DNA sequencing data even further. First I’ll describe the tools showcased in the webcast. Then I’ll detail the new tools we created to allow for a revised and simplified workflow.
Subset Informative Genotypes by Category
The Subset Informative Genotypes by Category tool scans genotypic columns to find informative genotypes, defined as having at least one non-missing, non-reference allele in any sample. Informative genotype column sets are found for each unique category in a user-defined categorical column. Dr. Kaufman uses this tool at 20:20 in the video.
This script simplifies the DNA-Seq workflow by focusing your filtering on the variants localized to the samples of interest. Categories may include (but are not limited to) family, population group, or even individual sample.
Build Sample Collated Spreadsheet
The Build Sample Collated Spreadsheet is a tool that transposes and collates (join and sort the columns into sample groups) several spreadsheets together. The collated spreadsheet contains a row for each intersecting column in the original spreadsheets and several columns for each original row over all spreadsheets. Dr. Kaufman uses this tool at 25:03 in the video.
An obvious use case for this script is collating several spreadsheet together that were imported by the VCF Importer. Having all of the data in one spreadsheet facilitates filtering based on values reported in VCF files such as Read Depth and Quality Score. The filtered variant set can then be transposed back into a genotypic spreadsheet or the filter can be applied back to the original genotypic spreadsheet.
NEW: Filter on Multiple Columns
In Dr. Kaufman’s webcast, he applied different filtering criteria to several different columns in the Sample Collated Spreadsheet to filter out variants (between 26:25 and 28:00 in the recording). Since this required a separate filtering step for each column, a new tool was created to apply all filters in one step, saving the user the time and tedium of having to do each one separately.
In the first prompt, the user selects the columns to be filtered against. In the second prompt, the user defines the filtering criteria for each previously selected column. Categorical, Genotypic, and Binary columns have an “Activate by Category” -type filter, while Real and Integer columns have a threshold filter. A row must meet all filtering criteria to remain active.
This tool is slated to be shipped in SVS version 7.6.10, which will be released on September 6th, 2012.
NEW: Find de Novo Candidate Variants
This new tool combines code from Mendelian Error Check and Subset Informative Genotypes so that the following workflow can be accomplished in one step:
- Subset Informative Genotypes by Family ID
- Mendelian Error Check for each unique family spreadsheet (created in the previous step)
- Filter Mendelian Error output to include only Heterozygous Errors (repeated for each family)
- Apply filter to original spreadsheet (for each family) and create subsets
You can download it from our website: Find de Novo Candidate Variants.
This tool also allows the user to include informative homozygous (alt_alt) errors and has an option to only include affected children. A reference allele field is required if informative homozygous errors are included. This tool drastically reduces the number of new spreadsheets created. One new child node spreadsheet per (affected) child is created including only variant columns that are candidate functional polymorphisms.
This script will replace the actions performed by Dr. Kaufman between 19:55 and 24:05 in the webcast and will make the process easier and your project cleaner.
As always, new tools (i.e. scripts) are included with your purchase of SNP & Variation Suite and can be downloaded from our scripts repository at any time.