Updated: January 28, 2021
Packages: RNA-Seq, Power Seat
The following tutorial is designed to systematically introduce you to a number of techniques for analyzing your RNA-Seq or other high throughput sequencing data output within SVS. It is not meant to replicate all the workflows you might use in a complete analysis, but instead touch on a sampling of the more typical scenarios you may come across in your own studies.
To complete this tutorial you will need to download and unzip the following file, which includes a starter project for this tutorial:
Files included in the above ZIP file:
- Analyzing RNA Sequence Data Tutorial – Starter project containing isoforms and gene counts data.
The following tutorial is designed to systematically introduce you to a number of techniques for analyzing your RNA-Seq or other high throughput sequencing data within SVS. It is not meant to replicate all the workflows you might use in a complete analysis, but instead touch on a sampling of the more typical scenarios you may come across in your own studies.
Facets of analysis covered in this tutorial include:
- Sample Statistics and Upper Quartile Normalization
- Numeric Association and Visualization
- DESeq and Visualization
- BAM Visualization
The previously downloaded project contains two spreadsheets, Isoforms counts – Sheet 1 and Breast Cancer Subtypes + Gene Counts – Sheet 1. This data was obtained from RNA-seq on several breast cancer cell lines and was generated by the Golden Helix pipeline.
The datasets represent how many reads mapped to each gene or isoform (spreadsheet column) for each sample (spreadsheet row). However, only the Gene Counts spreadsheet will be used in this tutorial. The steps in this tutorial can be applied to any dataset that is compatible with SVS and was generated by a secondary analysis RNA-Seq pipeline. This tutorial also assumes that the count data has not yet been normalized. As you will see, this tutorial covers different normalization techniques, including one that is built-in to the DESeq Analysis tool.
The isoform and gene counts were generated by Golden Helix pipeline in .tsv format. These counts were then imported into SVS using Import RNA-Seq Tabularized Quantification in the project navigator Import menu. This import tool also recognizes marker map data included in the .tsv file and applies the marker map to the imported spreadsheet.
You can find an example of the format that is used by our Tabularized import tool in the SVS Manual.
If you are an SVS customer and have RNA-Seq data in a different format, contact our Bioinformatics Support Team and ask about how to import your data into SVS.
2. Sample Statistics and Normalization
RNA-Seq data generated as raw count data from RSEM contains proportional values as a result of reads that have the potential for multiple alignments to various transcripts found in the reference genome. This creates the possibility for read data to be non-uniquely mapped. SVS can assist in validating the read depth obtained on your samples and performing an upper quartile normalization step as outlined by Bullard et al. BMC Bioinformatics 2010.
A. Sample Statistics
First, examine the summary statistics of the read counts per sample. To do this, transpose the dataset and calculate column statistics on the transposed spreadsheet. Then create a side-by-side boxplot of the transposed columns and visually examine the sample distributions.
- Open Breast Cancer Subtypes + Gene Counts – Sheet 1. and select Edit >Transpose Spreadsheet. Click OK.
- From the transposed spreadsheet, choose Numeric > Statistics (per Column).
- Leave Minimum, First Quartile, Maximum, Third Quartile, Median, Sum and Outlier Thresholds checked. Uncheck all other output options (except the Integer columns and Real columns options inside the Include in calculations: box).
- The dialog should look like Figure 2-1. Click OK.
The resulting spreadsheet contains the selected statistics for each sample. The target depth had been set at 12.5 million reads, so the values in the Sum column should be close to this value.
- From Column Statistics, right-click on the Sum column header (8th column) and select Sort Descending.
This information displays the number of reads obtained for each sample. Sample BT549 has the closest approximation to the target read depth at 11,160,803.71 reads. The remaining reads for this sample were collected, but were not found in alignment to the reference genome.
Also notice that all the values in the Minimum and Q1 columns are 0. This not only demonstrates the skewness in the data but also the fact that the dataset is oversaturated with 0 values.
NOTE: By selecting additional options in the Column Statistics dialog, the mean and variance of the count data could also have been shown.
Next, visually compare the read depth distribution over all samples by creating a side-by-side boxplot.
- From the transposed spreadsheet Breast Cancer Subtypes + Gene Counts Transposed – Sheet 1, choose Plot > Columnwise Side-by-Side Boxplot.
Notice that one sample (HS578T) contains an extreme outlier compared to all other samples’ data. It will be important to keep this outlier in mind when reviewing the analysis results later in the tutorial.
This plot also shows that each sample has a skewed (non-normal) distribution. Later in the tutorial, you will apply a normalization procedure to this dataset before performing an association test. The other analysis method used, DESeq, has a built-in normalization procedure.
B. Minimum Read Threshold
As shown previously, the gene count data is overly saturated with 0 counts. To account for this we will be using a tool that allows you to require a minimum read count in a minimum number of samples.
- From Breast Cancer Subtypes + Gene Counts – Sheet 1, choose RNA-Seq > Activate Genes by Minimum Read Threshold
- Enter 5 in the first box and change the drop down value to occurrences. Enter 10 in the second box.
- The dialog should look like Figure 2-3. Click OK
14,804 gene counts should have passed the provided threshold values, with 14,807 columns remaining active. Create a subset of the spreadsheet.
- Choose Select > Column > Column Subset Spreadsheet
- Rename the subset to Minimum Read Threshold
C. Upper Quartile Normalization and Log Transform
Close all open spreadsheets and plots by selecting Window > Close All from the Project Navigator.
The next step is to normalize the data using an Upper-Quartile normalization and Log Transformation procedure available in SVS. The transformed dataset can then be used for association testing.
- Open the Minimum Read Threshold spreadsheet
- Select RNA-Seq > Normalization and Log Transformation.
- Leave the default settings as seen in Figure 2-4 and click OK.
This produces two output spreadsheets, Minimum Read Threshold – Upper Quartile Normalized and Minimum Read Threshold – Upper Quartile Normalized – log2 Transformed. The second spreadsheet has both procedures applied and should be used for further analysis.
Plot another side-by-side boxplot here to view and compare the normalized distributions.
- From Minimum Read Threshold – Upper Quartile Normalized – log2 Transformed, choose Edit > Transpose Spreadsheet.
- Select Current Spreadsheet after Spreadsheet as child of: and click OK.
- From the transposed spreadsheet choose Plot > Columnwise Side-by-Side Boxplots.
As evidenced in Figure 2-5, the normalization procedure has worked as expected and the distributions across all samples look considerably more comparable.
This breast cancer samples in this dataset comprise three subtypes, Basal, Claudin and Luminal. Next, perform PCA Analysis to examine and compare relationships between the three subtypes.
- From the Minimum Read Threshold – Upper Quartile Normalized – log2 Transformed spreadsheet, select Numeric > Numeric Principal Component Analysis.
- Click Yes when asked if you want to continue.
- After Data Centering, select the checkbox for Center data by marker, which in this case means to center by gene.
- The dialog should look like Figure 2-6. Click Run.
Two output spreadsheets are created, Principal Components (Center by Marker) and PC Eigenvalues (Center by Marker). The PC Eigenvalues (Center by Marker) spreadsheet lists the top 10 calculated Eigenvalues. Since there are three major subtypes, one would expect the top two or three eigenvalues to be able to differentiate the majority of the data. In other words, the eigenvalues should begin to stabilize after the first three, which seems to be the case.
Next, create an NxN PCA plot of the top four principal components. You would expect to lose differentiation by subtype in the fourth principal component and creating a 4×4 PCA plot should confirm this.
Before creating the plot, the subtype information needs to be joined with the Principal Components.
- Open Breast Cancer Subtypes + Gene Counts – Sheet 1.
- Choose Select > Column > Inactivate all columns. Then click on column 1 to reactivate the Type column. This will create spreadsheet Breast Cancer Subtypes + Gene Counts – Sheet 2, which will have one column active.
- Now select File > Join or Merge Spreadsheets.
- Choose Principal Components (Center by Marker) from the list.
- Enter Pheno + PCs as the New dataset name: and click OK.
Create the scatter plot from this spreadsheet.
- From Pheno + PCs – Sheet 1, select Plot > N x N Scatter Plot.
- Click Add Columns and select the first four columns (EV=10.4881, EV=4.58711, EV=2.99062, and EV=2.4402). Click OK.
- After Optionally choose grouping variable, click Select Column.
- Select Type and click OK. Then click OK to create the plot.
This produces PCA scatter plots, colored by sample type, of all combinations of the first four principal components taken two at a time (Figure 2-7). We see that the scatter plot between the first two principal components (shown in two different directions in the upper-left corner) is well-segregated by subtype.
Looking in further detail, we see that the first principal component exhibits good segregation by all subtypes, and that the second principal component is good at segregating Basal from the other two subtypes. The 3rd and 4th components do not really segregate any of the three subtypes at all.
NOTE: Upper quartile normalization and log transformation is an important step that is performed on the raw count data from RNA-Seq output created by the RSEM pipeline. Data imported from other RNA-Seq pipelines such as FPM is also supported and requires upper quartile normalization as well. However, RNA-Seq FPKM data produced from Top Hat has already been normalized and only requires a log2 transformation of the data as well as PCA.
3. Numeric Association and P-P Plots
Next, perform a numeric association test on the transformed dataset. This type of test requires a binary dependent column–thus, only two subtypes can be tested per run. The second and third columns in the dataset contain binary expansions of the Claudin and Luminal subtypes.
A. Numeric Association Test
- Open the Minimum Read Threshold – Upper Quartile Normalized – log2 Transformed spreadsheet.
- Right-click on the first column header (Type) and select Activate by Category.
- Select the Claudin and Luminal categories (and deselect Basal) by first clicking on Claudin, then, while holding down the Ctrl key, clicking on Luminal.
- The dialog should look like Figure 3-1. Click OK.
This will inactivate the samples with Basal subtype, allowing the association test to compare only the (active) rows corresponding to the Claudin and Luminal subtypes.
- Make Type=Luminal? the dependent column by clicking once on its column name header. The column will turn magenta.
- Select Numeric > Numeric Association Tests.
- Click Yes when asked if you would like to continue.
- Match the settings to Figure 3-2 by checking T-test.
- Click Run.
B. Visualizing Results
Visualize the log-10 P-value by creating a numeric plot.
- From the Association Tests spreadsheet, right-click on the T Test -log10 P column header and select Plot Variable in GenomeBrowse.
From the association results, you could mark the genes with the largest -log10 P-values for further investigation. The next step in this tutorial will lead you through an analysis method specifically designed for RNA-Seq data.
4. DESeq Analysis and Visualization
DESeq is an analysis tool for analyzing variance in numerical count data produced from high throughput analysis tools, such as RNA-Seq. These analysis techniques were first published in a paper from Anders & Huber, 2010.
This technique provides an internal normalization technique and therefore the test should be performed on the original Minimum Read Threshold spreadsheet that has not yet been normalized.
A. DESeq Variance Analysis
- Open Minimum Read Threshold.
- Click on the Type column name header, which turns the column magenta and designates the column as the dependent variable.
- Select RNA-Seq > DESeq Analysis.
- In the first tab, select the analysis options that are shown in Figure 4-1. The options in the second tab will remain the default options.
- Click OK.
Two output spreadsheet are created, P-values for Conditions A (Claudin) and B (Luminal) and VST-Transformed Counts for G/T with the Best 50 P-values. The first spreadsheet contains the analysis results while the second is an intermediate spreadsheet that will be used for plotting purposes.
B. Plotting Variance Results
Calling Differential Expression
- Open the spreadsheet P-Values for Conditions A (Claudin) and B (Luminal).
- Select Plot > XY Scatter Plots.
- In the left column, select Log 2 (Combined Base Mean) as the Independent Axis (X), in the right column select Log 2 Fold Change as the Dependent Axis (Y).
- Click Plot.
Now color the plot based on a p-value threshold.
- In the Graph Control Interface, click on the Log 2 Fold Change node and select the Color tab below.
- Click the radio button By Variable and then click the Select Variable… button.
- Choose FDR-corrected p-value (fourth down), select < in the bottom right and type 0.001.
- Click OK.
The image should appear much like Figure 4-2. The blue color indicates p-values less than 0.001 after they are adjusted for multiple testing with the Benjamini-Hochberg false discovery rate (FDR) procedure.
This plot demonstrates how the overall counts (combined base mean) affect the amount of fold change needed to detect differential expression. As the overall counts become larger, the surrounding noise decreases, making smaller fold changes significant. On the other hand, the surrounding noise around smaller overall counts causes them to require large fold changes to be significant.
- Open P-Values for Conditions A (Claudin) and B (Luminal).
- Select Plot > XY Scatter Plots.
- In the left column, select Log 2 Fold Change as the Independent Axis (X) and in the right column select -Log 10 P-Value the Dependent Axis (Y).
- Click Plot.
- In the Graph Control Interface, select the -Log10 P-Value node and click on the Color tab below.
- Click the By Variable radio button and then click the Select Variable… button.
- Choose FDR-corrected p-value, select > in the bottom right, type 1E-6, and click OK.
The plot should appear much like Figure 4-3.
Dendrograms and Heat Map
- Open VST-Transformed Counts for G/T with the Best 50 P-Values.
- Select RNA-Seq > Dendrograms and Heatmap.
- Leave the default options as they are and click OK.
The plot (Figure 4-4) shows two well-segregated groups that correspond to Claudin and Luminal subtypes. In the heatmap, blue represents low levels of expression while red represents high levels.
Plotting P-values in genomic space
- Open P-Values for Conditions A (Claudin) and B (Luminal) and right-click on the -Log 10 P-Value column header. Select Sort Descending.
This brings the most interesting potentially predictor genes to the top of the list. LEPREL1 is indicated as the top gene on this list.
- Right-click on the column header for the -Log10 P-Value column.
- Select Plot Variable in GenomeBrowse.
This provides a visual display of the -log 10 P values in genomic space. The chromosomes and genes are displayed on the bottom X axis and the -log 10 P values on the Y axis.
- In the Plot Tree select either -Log 10 P-Value graph node. Then select the Style tab in the Controls window.
- Select Chromosome from the drop down menu under Style By:.
This creates a Manhattan Plot where each chromosome is a different color (Figure 4-5).
- Click on the uppermost yellow-orange dot in Chromosome 3.
The Console in the lower-left corner should now show the name LEPREL1, which corresponds to the best predictor gene shown in the spreadsheet P-Values for Conditions A (Claudin) and B (Luminal).
- Zoom in on this gene by selecting the magnifying glass on the zoom toolbar, then pressing your left mouse button to the left of this gene and while holding down your left mouse button dragging the cursor across to the right of this gene, finally releasing the mouse button to complete the zoom. Repeat this process until the dot representing the gene not only has a label, but has become a horizontal oval taking up at least a quarter of the horizontal distance across the plot.
Notice (Figure 4-6) that according to the RefSeq Genes 105 Interim v1, NCBI track in the bottom plot, the genetic coordinates taken up by “LEPREL1” are instead supposed to be gene P3H2. What is going on? Where is LEPREL1?
NOTE: You may want to vertically expand the bottom plot, RefSeq Genes 105 Interim v1, NCBI, in order to be able to see gene P3H2, or to assure yourself that you haven’t missed anything in the bottom plot.
- Now hover the mouse over the P3H2 gene in the bottom plot until a box forms around its name, then click. Information about gene P3H2 should appear in the Data Console, which is in the lower-left corner of the plot.
- Vertically expand the Data Console box, if necessary, then scroll the Data Console down to the GeneID field (near the bottom).
- Click on the 55214 Gene ID.
You will now be brought to an NCBI web page about gene P3H2. Note that this gene is also known as “MCVD; MLAT4; LEPREL1”. (Mystery solved!)
Several other links to databases are available from the information in the Data Console about gene P3H2 (LEPREL1), or about any other gene on whose name you click in the gene track.
NOTE: If the name of a gene in your RNA-Seq data is completely current, you may zoom in on that gene simply by entering its name into the Genomic Location Bar (at the top of the plot window above the Full Domain Band).
5. Visualizing Expression Data
Now to add the BAM files to our P-Value plot for visualization. The view should still be zoomed into the area covering the P3H2 (LEPREL1) gene, which was found to have a very small p-value.
- From the plot we have just been viewing (Plot of Column -Log10 P-Value from P-Values for Conditions A (Claudin) and B (Luminal)), do the following:
- Click the Plot button and choose Example Samples in the left-most pane. Most of these samples correspond to this Breast Cancer dataset. You will also see a few DNA-Seq files that we host on our server in this list.
- Check a couple of Luminal samples, such as HCC1428_12.5M and MCF7_12.5M, then click Plot.
- Now check a couple of Claudin samples, such as BT549_12.5M and HS578T_12.5M, then click Plot & Close.
The first two BAM files correspond to the Claudin group and show high levels of expressions for the P3H2 (LEPREL1) gene. The 3rd and 4th BAM files correspond to the Luminal group and show low levels of expression for the gene.
This discrepancy in expression levels between the breast cancer subtypes follows our expectations based on the small DESeq p-value.