- A new feature, Genotype > Compute Large Data PCA, has been implemented. This feature can find a few principal components from a dataset that may contain many thousands of samples, up to 10 times as many samples as was previously possible to process, using much less time than previously possible. This feature works with either genotypic or numeric data.
- The PhoRank Gene Ranking feature has a new default mode called PhoRank Clinical. This algorithm uses semantic similarity for phenotypic prioritization and demonstrates a direct clinical relationship between phenotypes and genes. PhoRank Research is the previously available version of PhoRank and uses ontology propagation for phenotypic prioritization and is still available for use in SVS.
Add On Script Improvements
- The Add-on script Calculate Pseudo Lambda has been replaced by the new script, Calculate Approximate Lambda from P Values, which follows more standard practice. Please use the new script if you wish to obtain an approximate (Genomic-Control) lambda value from a spreadsheet column of p-values for a test for which no chi-squared values nor a lambda value have been output.
- A new Add-on Script, Convert Real Columns to Single Precision This script recodes the double-precision data in the “Real” columns in your spreadsheet into single-precision data. The script executes immediately after selecting it. A new spreadsheet will be generated which will be the same as the original spreadsheet except for the “Real” columns being encoded in single precision. (Note: If the number is larger in magnitude than can be represented in single precision, that will be designated with an “inf” or “-inf”.)
- A new Add-on Script, MLM with Multiple Phenotypes runs the equivalent of Spreadsheet > Genotype > Mixed Linear Model Analysis on multiple dependent phenotype columns, one column at a time. All possible Mixed Linear Model Analysis tests can be run on each of your dependent columns, which may be binary, integer-valued or real-valued columns.
- A new Add-on Script, Row Averages for Active Numeric Columns, has been implemented. This script is not limited to a set of columns which must be hand-selected in a dialog, but will instead simply use all active numeric columns as input to its averaging. Upon finishing, this script will produce a histogram of the row averages (as does the Add-on Script Row Averages with Histogram).
- The Add-on Script Consecutive Numeric Regression Analysis will now (1) complete without hanging up SVS itself, (2) no longer stop and claim user cancellation when regressing on a numeric dependent variable using covariate(s), (3) show the outputs in the original dependent variable order, and (4) show every regression’s dependent variable in the final log.
- The Add-on Script, Tall Skinny Import now has an extra checkbox asking whether data for each individual marker is clumped together or data for each individual sample is clumped together, and use that choice to adjust the prompts and as a basis for whether or not to transpose the output spreadsheet as a final step.
- The Add-on Script Row Averages with Histogram will now (1) average the columns that have been selected for averaging, rather than using columns that are one column to the right of what were selected, and (2) no longer crash if some columns have been inactivated. Also, performance of this script has been optimized.
- The Target Region CNV algorithm has been upgraded with new options, filters and quality flags for the NGS Exome use case. For more details see our NGS Exome CNV Calling tutorial and the documentation CNV Caller on Target Regions.
- The RefSeq shipped gene track for GRCh37 and GRCh38 have been updated with the latest gene names and aliases. Similarly, the default clinically relevant transcript in humans has been updated to reflect the latest data from MANE transcripts and ClinVar Assessments.
- There are RefSeq transcript alignments to the human genome that have to compensate for differences in the reference sequence to the canonical RNA sequence. When these differences are one or two missing or additional bases, the transcript alignment introduces a one or two base “intron” (alignment gap). The transcript annotations algorithm will no longer classify variants in these gapped introns as canonical splice site mutations, but simply as splice region variants. This results in fewer benign polymorphisms from being classified and considered incorrectly as loss of function variants.
- The HGVS output for variants before and after transcript sequences has been improved to be able to use the “dup” syntax for insertions. For example: NR_003051n.-22_-15dupTACTCTGT.
- The default cache and analysis memory setting have been updated to 2GB from 1GB, but previously saved settings are respected. Go to Tools > Global Product Options and click Restore Defaults to update the Memory Usage settings to the new defaults.
- The built-in downloader for SVS has been updated to support more firewall and proxy settings, improve resuming of downloading when network connections are lost, and better detection of in-progress downloads started by another SVS instance on the same computer or accessing the same the same shared network drive for annotations.
- SVS now runs on Mac with the “Dark Mode” setting enabled. Previously, the setting caused a mix of inverted and regular colors. While SVS does not have a separate dark color palette, it now will consistently use the default colors and remain usable in dark mode.
- Attempting to import a VCF file with a detected genome assembly different than that of the selected assembly for the project template will produce a warning on the select sources screen and prompt with a final warning message when completing the wizard.
- When inputting terms into the PhoRank dialog as HPO terms (such as HP:0001637), you can now see the detected HPO term name by hovering over the input before running the algorithm.
- For Genotype > Compute GBLUP Using Bins (and sometimes Genotype > Compute Genomic BLUP (GBLUP)), convergence has been improved, and in the case that convergence still has not taken place, the behavior has been made more helpful.
- For Genotype > Mixed Linear Model Analysis and Genotype > Mixed Linear Model Analysis with Interactions, documentation for calculating the Beta Standard Error has been corrected.
- Documentation in the Formulas and Theories section for the Beta Standard Error for a covariate has been augmented to demonstrate two possible approaches to calculating this parameter–one of these approaches is used by Genotype > Genotypic Regression Analysis and Numeric > Numeric Regression Analysis, and the other by Genotype > Mixed Linear Model Analysis and Genotype > Mixed Linear Model Analysis with Interactions.
- The following changes are to the Multi-locus mixed model GWAS (MLMM) feature within Spreadsheet > Genotype > Mixed Linear Model Analysis:
- The Proportion of Variance Explained (PVE) is now output for Multi-Locus Mixed Model (MLMM) “Cofactor” (covariate) markers (in addition to being output for other markers in the p-value spreadsheet).
- EMMAX scans are now performed for all MLMM Backward-Elimination steps that eliminate markers in a different order from that in which Forward Selection added. This is done because these Backward-Elimination-generated models are different from any previous models generated in the same MLMM run.
- Improved handling of plotting mitochondrial chromosome variants regardless of “M” or “MT” being used as the chromosome name.
- Newly created BAM plots will now have the “Filter Duplicate Alignments” option turned on, to match the coverage statistics algorithms used in CNV calling
- You can now run Genotype > K-Fold Cross Validation (for Genomic Prediction) and use one or both of the Bayes’ methods, Bayes C-pi or Bayes C, without also specifying Correct for Gender. In SVS 8.9.0, this would error out.
- For Spreadsheet > File > Save As… > Variant Call Format (VCF) when the (optional) alternate allele field was specified, if, for a given marker, an allele was present in that marker that was not specified for that marker in its alternate allele(s) field, the whole export process would error out. Now, given the same situation, the error will be noted in the node change log, the marker will be skipped for export, and the exporting process will finish.
- Using either numeric or genotypic regression from a script or the Python window would make SVS itself hang up if the dependent variable had missing values. This has now been fixed.
- Performing Genotype > Genotype Statistics by Sample will now work if there is one dependent variable which is numeric. If this numeric dependent variable has missing values, a second output spreadsheet which divides the samples into two categories, non-missing dependent value vs. missing dependent value, will be created. This feature has always been implemented for scripts and the Python window, but previously, performing Statistics by Sample interactively with a numeric dependent variable would simply get an error message and no output.
- Haplotype Block Detection can now be performed from a script or the Python window independently of whether or not any columns have been made dependent, and of what type those dependent columns are. Previously, the script/Python window version of this feature demanded either a case/control dependent variable (which was not really used) or no dependent column at all.
- Genotype > Compute Genomic BLUP (GBLUP) using Overall Normalization will now still function even if all markers turn out to be monomorphic.
- In the Node Change Logs for the results of Genotype > Compute GBLUP Using Bins, the list titled “Variance per marker by bin” actually consisted of pseudo-heritability per marker by bin for each bin, with erroneous entries for the standard error of these values. Now, both of these lists are displayed using the proper header for each list and correct standard error values.
- The (script) feature Spreadsheet > Edit > Create Labels from Marker Map Field, which has had several issues, has been eliminated. Please use Edit > Recode > Rename Marker Mapped Labels instead. The latter feature, besides being reliable, allows you to specify “Chromosome:Position” as an additional choice for your column headers or row labels.
- For the Linear Regression feature of Spreadsheet > Genotype > Genotype Association Tests, which is available when the dependent variable is quantitative, there may have been certain rare use cases where a perfectly good regression was categorized as having failed, and other rare use cases where a failed regression was not caught. These have now been fixed.
- The following change refers to the feature Spreadsheet > Genotype > Mixed Linear Model Analysis, the feature Spreadsheet > Genotype > Mixed Linear Model Analysis with Interactions, and to the feature Spreadsheet > Genotype > Compute Genomic BLUP (GBLUP):
- For certain rare edge cases when the EMMA algorithm doesn’t converge, the “optimum delta” which was output was the delta corresponding to the best of several pre-computed likelihoods, but the likelihood corresponding to this best delta was previously not output. This affects the “log likelihood” column of the MLMM Step Information spreadsheet and the Log Likelihood Ratio Test in the node change log output for GBLUP (when EMMA is used).