Since Dr. Ken Kaufman gave his webcast on Identifying Candidate Functional Polymorphisms in SVS, we’ve been working with Dr. Kaufman to simplify and automate many of the steps in his workflow. I touched on this in my last blog post, and I’m excited to report that with Ken’s help, we’ve been able to simplify the workflow even more.
In particular we’ve been able to automate some of the steps to repeat automatically over all trios. The benefit to this approach, in Ken’s own words:
“I did this workflow on 51 trios today. The first four steps take the longest (about an hour). The last three steps take about 15 minutes. Previously it would have taken me over a week to do all this. It takes about as much time to export the results individually as it does to generate them.”
We are currently working on making this a tutorial that will be featured in the DNA-Seq section of the SVS Tutorials page. Until that is completed, I want to provide our customers with a high-level outline of how to use these scripts in their workflow. I won’t go into too much detail here (the tutorial will be up soon!), but we invite you to contact our bioinformatics support team with any questions you may have.
Set Genotypes to No-Call based on Additional Spreadsheets
This tool is an expanded version of “Set Genotypes to No-Call based on Second Spreadsheet” that allows the user to filter based on several quality spreadsheets, such as read depth and quality score, in one pass.
Genotypic data will be set to no-call (?_?) given several filtering requirements that are applied to the corresponding numeric or categorical values. Each input spreadsheet must have at least one overlapping column header and row label.
This tool should also be used for all advanced ref/alt filtering regardless of the number of additional spreadsheets selected. “Set Genotypes to No-Call based on Second Spreadsheet” has been deprecated in SVS version 7.6.11 and replaced by Quality Assurance > Genotype > Set Genotypes to No-Call based on Additional Spreadsheets and Quality Assurance > Set Values to Missing based on Second Spreadsheet. The latter can be used on genotypic data as well as numeric and categorical.
Calculate Alt Read Ratio
Given alternate read depth and reference read depth spreadsheets, this tool calculates the ratio of alternate reads over total reads for each sample-variant cell.
In this workflow, heterozygous calls are of interest, which would have expected ratios near 50%. Distributions that have peaks around different values (such as 25%) most likely correspond to a misalignment of a replicated sequence and can be filtered out using “Set Genotypes to No-Call based on Additional Spreadsheets” or “Set Values to Missing based on Second Spreadsheet”.
This tool is available on our website as an add-on script.
Generalized Workflow for Identifying Candidate Functional Polymorphisms Using New Tools
- Import VCF file with multiple trios, [Import > Import VCF from One or More Files]
- Filter down to rare variants using the Exome sequence project allele frequency track (NHLBI ESP6500) [Select > Filter by Annotation > Filter on Variant Frequency Catalog].
- Create a subset and rename it to Rare Variants.
- Run Variant Classification to generate a spreadsheet called Coding Variant Classification. You will use this to filter down to nonsynonymous variants within gene regions.
- Right-click on the first column header (Classification) and activate all categories except for Synonymous.
- Choose Select > Apply Filter to Spreadsheet.
- Create a column subset of Rare Variants and rename to Rare Nonsynonymous.
- Filter on depth and Genotype quality scores [Quality Assurance > Genotype > Set Genotypes to No-Call based on Additional Spreadsheets]
- Filter on both DP and QS in one step. Filter out variants with DP < 15 and QS < 20.
- Zero heterozygous calls that have alt reads less than 10, homozygous alternate calls that have alt reads greater than 10, and homozygous reference calls that have alt reads less than 5 using Advanced Ref/Alt Filtering in Set Genotypes to No Call Based on Additional Spreadsheets.
- Calculate Alt Read Ratio [Quality Assurance > Calculate Alt Read Ratio].
- Filter out calls with Alt Read ratio < .25 [Quality Assurance > Set Values to Missing based on Second Spreadsheet].
- Merge Phenotype information with filtered Genotypes.
- Run Quality Assurance > Find de Novo Candidate Variants.
- Run Quality Assurance > Score Compound Heterozygous Regions.
- Run Quality Assurance > Score Variants by Recessive Model.