The SVS 8.8.3 release was created to incorporate some of the CNV, genome assembly control, and splice site capabilities that are present in VarSeq, as well as clean up and streamline the GWAS workflows (like when using Mixed Linear Model algorithms) for a better user experience.
New Product Add-Ons for SVS
- GoldenHelix SVS now includes in-silico splice site, functional prediction and conservation methods available with the Clinical Variant Scoring add-on. This includes four splice site algorithms, SIFT/PolyPhen2 predictions that run on 100-way multiple species alignments, and GERP++ and PhyloP methods to calculate conservation scores.
- CNV calling designed for low and ultra-low read depth Whole Genome data is now available as part of the CNV add-on in SVS!
Designed for calling large cytogenetic events, this algorithm is able to detect chromosomal aneuploidyevents with high confidence with ultra low read depth genomes (as few as onemillion aligned reads, or 0.02X coverage).
With higher read-depth Whole Genomes, the CNV Caller on whole genomes can be adjusted to detect up to 10 events per sample, down to 100Kb sized events.
- Linear or logistic regression may now be performed directly on genotypic column data using the new spreadsheet Genotype menu item, Genotypic Regression Analysis. This feature first dynamically converts genotypic column data to numeric values according to your choice of genetic model and how to handle missing genotypic data, then performs the regression analysis on this numeric data. All of the features of Numeric > Numeric Regression Analysis (see Numeric Regression Analysis) are available for your selection in Genotypic Regression Analysis, which can additionally output your choice of genotype-related statistics.
- On import of VCF files, you can now lift-over to the assembly of the project by specifying the “Assembly of Input Files” for your selected VCFs. Built-in liftover chain files support going from GRCh37 (hg18) to GRCh38 (hg38) and visa-
versa. Other chain files can be aquiredfrom UCSC for other assemblies.
- A number of improvements have been made to Numeric> Numeric Regression Analysis. These improvements have also been incorporated into the new feature Genotype > Genotypic Regression Analysis.
- The covariate and interaction term windows now show implied covariates and interaction terms in a grayed-out font–these include, for instance, “(The current genotypic column)” or “(The current dynamic window)”. These also include covariate-column interaction terms by themselves as reduced-model covariates.
All the terms that you enter yourself are still on display in a normal font. This makes the entire regression model more clear, especially for covariate-column
- Better checking for duplicate covariates, both in the same and different covariate windows, is performed.
- When you add individual covariates, the selection window will close after you click“add”. (The window for interaction terms, on the other hand, will stay open after you click “add” to facilitate adding further interactions.)
- Collinearitychecking for the selected covariates and interaction terms is now more thorough.
- The headers of the covariate and interaction term Beta value and Beta Standard Error value spreadsheet outputs for single-column regressions and covariate-column interaction regressions now clearly identify the covariate or interaction term involved in each output column.
- The X-axis of meta-analysis forest plots has been cleaned up and been given a reasonable range for each set of data being plotted. Additionally, documentation for forest plots has been improved.
- ForSingle-Locus Mixed Models (EMMAX) and Single-Locus Mixed Models withInteractions, all Beta values, and their standard errors are output, along with the name of the covariate associated with each Beta or standard error output.
- The scaling factor used to scale the kinship matrix for Single-Locus Mixed Models (EMMAX), Single-Locus Mixed Models with Interactions, and Multi-Locus Mixed Models(MLMM) is now displayed for these features and explained better in the documentation.
- For most mixed-model features, including GBLUP, Single-Locus Mixed Model (EMMAX),Multi-Locus Mixed Model (
MLLM), and Single-Locus Mixed Model with Interactions,if you specify a covariate which is collinear with one or more other covariates you have specified, you will now get a sophisticated diagnostic to help you remove that collinearity from your data.
- The following improvements have been made to the GBLUP (and Genetic Correlation)Average Information REML algorithm:
- The precision of the calculations has been tightened from .0001 to .000001.
- An optimization has been made that doubles the calculation speed for most use-cases but improves the speed five times for the Genetic Correlation feature.
- The progress dialog for this algorithm shows more information about what is happening and may more easily be used to cancel the computation.
- DifferentGenome Assemblies have been added to this release and can be found by going to Tools > Manage Genome Assemblies.
- Cow, Bostaurus (ARS-UCD1.2)
- Goat, Caprahircus (ARS1)
- Cotton, Gossypium hirsutum (UTX-JGI v1.1)
Ictalrus punctatus(IpCoco 1.2)
- Turkey, Meleagris gallopavo (melGal5)
- Pig, Susscrofa (Sscrofa 11.1)
- Due to the confusion it caused, we deprecated the reference genome assembly that contained the hg18 mitochondrial and we auto-update all projects to use the new “GRCh37(hg19)” reference. Older annotations are also now hidden by default to preventaccidental usage.
- The HGVSnotation produced by our transcript annotation algorithm shortens very long insertions/deletions using abbreviated notation when changing more than five amino acids. Also, variants affecting the start codon and the stop codon have had their HGVS notation improved to be more informative.
- The copy number calling algorithms were updated to use a faster and near-equivalent segmentation algorithm strategy, improving run-times for exomes and gene panels with many targets in a small genomic region. In testing, this improvement causes no loss in sensitivity of true positive events. Adding a new CNValgorithm to a project will use the new strategy by default unless the option to use the slower optimal segmentation algorithm is selected in the advanced parameters tab. Existing templates will run with the previous strategy until they are manually updated.
- Integrated downloading of
AffyMetrixresources such as marker maps was removed as thedependentAPIs were deprecated.
nowdoesa better job scaling the user interface when running on Windows computerswithhigh DPI settings ( high resolutionmonitors with scaling factors applied).
- When running Single-Locus Mixed Models (EMMAX), Single-Locus Mixed Models with Interactions, or Multi-Locus Mixed Models (MLMM), “Future warnings” would sometimes appear in the Python shell. These have been removed.
If,in GBLUP, a genotypic covariate column or columns followed at least one inactive column, the wrong data would be used for the covariate values. This has beenfixed.
- In the regression feature of Genotype Association Testing, the proper formula is now used for the intercept standard error.
- The wrong covariances were being used to compute the standard error of heritability forSingle-Locus Mixed Models (EMMAX), Single-Locus Mixed Models with Interactions, and the EMMA algorithm for GBLUP. Estimates of the variance and covariance of the Vg and Ve estimates are now obtained from the inverse of the Average Information matrix (as computed according to the Average Information algorithm).
- The Likelihood Ratio Test output from the EMMA algorithm for GBLUP has been fixed to use the restricted maximum-likelihood (REML) value for the reduced model (where Vg = 0). (The maximum-likelihood (ML) value, calculated improperly for odd numbers of samples, had been used before.)
- If the standard error values and confidence intervals cannot be computed in a logistic regression, missing-value indicators (”?”) are now output, rather than zeros or other nonsense values.
- For Single-Locus Mixed Models with Interactions, if, for a given marker, all interaction terms must be removed, missing-value indicators for the
p-value-related outputs are now filled in, rather than nonsense values. Since the reduced model is still preserved in this circumstance as the one and only model, the Beta’s and Beta standard errors for the reduced model are still output.
- InSingle-Locus Mixed Models (EMMAX), Single-Locus Mixed Models with Interactions, and Multi-Locus Mixed Models (MLMM), the number of genotype columns that are multi-allelic are now counted correctly when the genetic model being used is dominant or recessive. Before, these were lumped in with “Markers with no data.”
- For GBLUP (using the EMMA algorithm), Single-Locus Mixed Model (EMMAX), Multi-Locus MixedModel (
MLLM), and Single-Locus Mixed Model with Interactions, the computation of the variance component ratio (“delta”) has been made more robust to round-off error when this error causes some of the non-zero eigenvalues of therestricted-model matrix S(K+I)S to be less than one.
While this change will now yield correct answers much more of the time, there still could be situations where it will not. Therefore, a warning message is now output if this specific situation happens.
Before, if this situation occurred, GBLUP would often crash, and the other MLM features would often give incorrect answers.
- GBLUP would crash if the dependent variable was all zeroes. This has been fixed.
- “RemoveSelected” from the covariate selection window of Bayesian Genomic Prediction no longer removes all covariates (selected or not).
- The Select> Activate by Gene List feature was lower-casing instead of upper-casing gene names, resulting in not finding matched genes.
- TheLarge-Data (EMMA algorithm) GBLUP feature is now restored. It had been crashing immediately after it had finished computing or reading in the GRM kinship matrix.
- If the scripting interface function genotypeColumnData for the Tabular module is used in Ref/Alt mode with the multi AltOK flag set to True, and there is no reference allele in a given genotypic column, all missing-value indicators will now be returned rather than unreliable data.
- The VCFimport feature improved the case of merging genotypes for the same variants that come from different files where some of those files contained the genotype in multi-allelic form. Other improvements in multi-sample import were also made to handle more edge cases.
- TheSingle-Locus Mixed Model (EMMAX), Multi-Locus Mixed Model (
MLLM), and Single-Locus Mixed Model with Interactions featuresnow monitor whether there are too few samples present to perform computations, or, in the case of MLMM ,continuestepwise forward selection.
- TheMulti-Locus Mixed Model (
MLLM) feature will now discontinue stepwise forwardselectionif the variance is fully explained by covariates. Estimatesofthe variance and covariance of the Vg and Ve estimates are now provided step of the Multi-Locus Mixed Model (MLMM) feature. forevery