Clinical Variant Analysis for Cancer – Applying AMP Guidelines to Analyze Somatic Variants
As described in my eBook “Genetic Testing for Cancer,” any bioinformatic pipeline for cancer ultimately calls variants based on the aligned reads that the sequencer generated. Variant calling is the process of reviewing a sequence alignment, typically in the form of a BAM file, to identify loci that differ from the reference genome. Single Nucleotide Variants (commonly called “SNVs” or “SNPs”) are the most common type of variation, followed by insertions and deletions (jointly referred to as “indels”) and Multiple Nucleotide Variants (“MNVs” or “substitutions”). It is also possible to detect Copy Number Variations (CNVs) and other structural variants such as inversions and translocations, although whole-genome sequence data is often required to identify these features accurately. Variant calls are typically stored in a “VCF” file. The VCF file format is flexible but typically captures the observed genotype at any genomic coordinate where a variant is observed together with technical data such as read depth and quality scores.
In the context of cancer testing, sequence reads from both tumor tissue and matched normal tissue samples can be aligned and compared to detect somatic variants in cancer cells. Probabilistic techniques have been developed to compute the probabilities of somatic mutations (Roth, 2012; Larson, 2012) and machine learning has been used to train classifiers that can detect somatic mutations (Ding, 2012).
In cancer testing, the variants of interest are somatic (occurring in the tumor cells, but not the normal germline cells of the patient). Because of the heterogeneity of tumor cell colonies, mutations that occur in only 50% of tumor cells must be detected, with sensitivity down to 10% of tumor cells desirable (according to the guidelines of the American Society of Cancer Oncology, see Leighl (2014)). Biopsies taken for diagnosis and sequenced will also be a mix of tumor and normal cells. The implication is that variants must be called when somatic mutations occur in only a small fraction of the sequences at a given genomic locus, often calling mutations that occur only in 1-5% of sequence reads (called the allelic frequency). When a biopsy of normal cells for a patient is taken alongside the tumor, algorithms can consider both in conjunction and use probabilistic techniques to detect somatic mutations (Ding, 2012). The performance and agreement of these algorithms can vary widely depending on the properties of the input data (Xu et al. (2014)).
Cancer panel tests are often performed only on tumor samples (without matched-normal samples sequenced), in which case determining which variants are somatic mutations is done in the filtering and annotation process.
Clinical cancer analysis generally focuses on somatic mutations found in tumors. Novel somatic mutations are typically identified by comparing tumor sequences to a matched normal sequence. However, it is increasingly common to sequence only the tumor tissue when performing gene panel tests. The genes tested by standard cancer panels are well characterized, and certain mutations within those genes are known to occur frequently in certain cancers. Mutations at the same sites are extremely rare in non-cancerous tissues. Analysis of tumor sequences in the absence of a matched normal is therefore capable of identifying known cancer-associated mutations, although this lacks the power to distinguish between somatic mutations and germ-line variants at novel sites.
Filtering and Annotation
After calling variants, care must be taken to confirm the quality of the sequencing, review the identified variants, annotate them according to known or predicted consequences, and identify which variants, if any, may be clinically actionable.
There are several public and proprietary databases that contain information about previously observed somatic mutations in tumors. One of the most popular databases is COSMIC, a publicly accessible database of mutations extracted from research articles and from The Cancer Genome Atlas (TCGA). COSMIC catalogs many important data points about each mutation, including the tumor location, histology, and internet links to original publications and case reports.
A typical annotation and filtering workflow might include the following steps:
- Annotate all variants based on gene location and resulting changes to the protein product of the gene.
- Compare the variant list with COSMIC and retrieve annotations for any matching mutations.
- Flag variants with poor sequencing quality (based on coverage depth or other metrics).
- Identify somatic mutations. Compare to normal tissue if available to remove germ-line variants. Otherwise, remove common germ-line variants as reported in population databases such as the 1000 Genomes project and focus on variants observed with allelic frequency in the expected range for somatic mutations (for example, variants that appear in 1% to 15% of reads at the site).
- Prepare final list of probable somatic mutations, together with all available annotation data.
Fig 2, Fig 3, Fig 4 and Fig 5 provide a comprehensive overview of key databases for the interpretation of somatic sequence variants.
Fig 2: Population database to exclude common variants
If you wish to continue reading the eBook, I invite you to download a complimentary copy of my eBook. You can do so by clicking on the button below.