Welcome to the Family-Based Analysis Using SVS PBAT Tutorial!
Updated: January 28, 2021
Packages: PBAT Analysis
This tutorial leads you through family-based association analysis using the PBAT statistical package incorporated into SNP & Variation Suite (SVS). Covered workflows include data preparation, quality assurance testing, association analysis, and basic visualization of results.
Golden Helix SVS PBAT is developed in collaboration with Dr. Christoph Lange of Harvard University’s School of Public Health.
NOTE: The data used in this tutorial is for demonstration purposes only as it consists of simulated phenotypic information for the CEU HapMap samples.
To complete this tutorial you will need to download and unzip the following file, which includes several datasets.
Files included in the above ZIP file:
- CEU – PED.csv – Actual pedigree information for the CEU HapMap samples (Phase III).
- CEU – SIM – PHENO.csv – Simulated phenotype and clinical data.
- CEU – GENO – Chr22.dsf – Actual chromosome 22 genotypes for the CEU HapMap samples (Phase III) generated from a combination of Affymetrix and Illumina arrays.
We hope you enjoy the experience and look forward to your feedback.
1. Data Preparation
In order to run PBAT in SVS you need, at minimum, a spreadsheet containing pedigree information (including Family ID, Patient ID, Mother ID, Father ID, Sex, and Affection Status) and genetic data (either genotypes or continuous variables, such as log ratios). If there is phenotype data in addition to (or instead of) the Affection Status, you first need to join it with your pedigree and genetic data in order for Golden Helix SVS PBAT to be able to access it. The following steps lead you through importing each data type separately and then merging this data into a single spreadsheet.
A. Import Pedigree Information
Before you can begin you need to create a new project.
- Open SVS and from the Welcome Screen select File > New Project.
- Name the project PBAT Tutorial, browse to a directory where you want the project saved, keep the default genome assembly Homo sapiens (Human) GRCh37 (hg19) (Feb 2009), and click OK. This will open the Project Navigator.
The first file to import is CEU – PED.csv contained within the downloaded zip file. This is a comma-delimited CSV file with pedigree information for the CEU HapMap samples (Phase III).
- Select Import > Family Pedigree > Text Pedigree.
- Browse to the directory where you saved CEU – PED.csv, select CEU – PED.csv, and click Open.
- Under Row Labels select Use column number: 1.
- Choose the Sex is encoded as 0/1/2 (or ?/1/2) radio button.
- Choose the Affection Status is encoded as 0/1/2 (or ?/1/2) radio button.
NOTE: If the default options (?/0/1) are used for encoding Sex and Affection Status, the resulting spreadsheet will not be recognized as a pedigree spreadsheet.
- Click OK.
This will create a new pedigree spreadsheet called CEU – PED Pedigree Dataset – Sheet 1 (Figure 1-1).
NOTE: Pedigree spreadsheets are denoted as such by a pedigree icon in the Project Navigator as well as blue headers for the pedigree columns at the front of the spreadsheet. If your imported spreadsheet has neither of these, it has not been recognized as a pedigree spreadsheet, and so certain analysis options will not be available.
B. Import Phenotype Information
Next you need to import CEU – SIM – PHENO.csv. This is a comma-delimited CSV file with simulated phenotype information. It is used for demonstration purposes only.
- From the Project Navigator select Import > Text.
- Browse to the directory where you saved CEU – SIM – PHENO.csv, select CEU – SIM – PHENO.csv, and click Open.
- Leave the rest of the parameters as defaults and click OK.
This will create a new spreadsheet called CEU – SIM – PHENO – Dataset – Sheet 1 (Figure 1-2).
C. Import Genotypes
Last, you need to import CEU – GENO – Chr22.dsf. This file contains actual genotypes on chromosome 22 for the CEU samples, which were generated by a combination of Affymetrix and Illumina platforms.
- From the Project Navigator select Import > Golden Helix DSF.
- Browse to the directory where you saved CEU – GENO – Chr22.DSF, select CEU – GENO – Chr22.DSF, and click Open.
This will create a new marker mapped spreadsheet called CEU – GENO – Chr 22 – Sheet 1 (Figure 1-3).
D. Merge Spreadsheets
Now that you have all three spreadsheets in the project you need to join them together. When joining spreadsheets it doesn’t matter which one you start from. However, if there is certain data you want located toward the front of your spreadsheet for easier viewing (e.g. phenotype data) you will want to initiate the join from that spreadsheet. When pedigree data is available (and denoted as such) this information will always be the first six columns of the spreadsheet.
- Open CEU – PED Pedigree Dataset – Sheet 1 and select File > Join or Merge Spreadsheets.
- From the spreadsheet chooser select CEU – SIM – PHENO – Dataset – Sheet 1 and click OK.
- Enter PED + PHENO for New dataset name:.
- Under Spreadsheet as Child of choose Current Spreadsheet.
- Leave all other parameters as the defaults and click OK.
This will create a new spreadsheet PED + PHENO – Sheet 1. Now join this one with the genotype spreadsheet.
- From PED + PHENO – Sheet 1 select File > Join or Merge Spreadsheets.
- Select CEU – GENO – Chr22 – Sheet 1 and click OK.
- Enter CEU All for New dataset name:.
- Under Spreadsheet as Child of choose Project root.
- Leave all other parameters as the defaults and click OK.
You now have all the data in one spreadsheet, CEU All – Sheet 1, and are ready for analysis.
CNV Analysis Through SVS PBAT
In addition to performing family-based association testing using genotypes as covariates, you can also perform association with various CNV covariates. Though not covered in this tutorial, you would go about PBAT CNV Analysis in the same manner as PBAT Genotype Analysis, except that instead of joining a genotype spreadsheet with your pedigree and phenotype information, you would join your CNV data. To learn more about processing microarray CNV data, see the SVS Microarray CNV Univariate Analysis Tutorial.
2. Quality Assurance
There are a number of quality control metrics in SVS to control for poor quality SNPs and samples, some of which are family-based and make use of SVS PBAT. This tutorial focuses specifically on PBAT Family-Based QC, which enables the detection of Mendelian errors and samples with overall poor genotype quality.
NOTE: Though not covered in this tutorial, it is still appropriate to apply other non-family-based quality assurance metrics to exclude poor quality samples and markers from analysis. Several additional options are available under the Genotype > Quality Assurance and Utilities spreadsheet menu. For more information about these options, see the Genotype Data Quality Assessment and Utilities section of the SVS Manual.
A. Quality Control by Marker
- Open CEU All – Sheet 1 and select Genotype > PBAT Family-Based QA.
- Under Computation parameters check Use alternative rapid pedigree algorithm. This option needs to be checked in order for PBAT to report Mendelian errors.
- Under Output choose Output by marker.
- Leave all parameters as the defaults and click Run.
Upon completion a new spreadsheet is created, PBAT QA Results (by Marker) (Figure 2-1), with various quality control statistics. In this tutorial we’ll focus on removing SNPs that have one or more Mendelian errors.
- Right-click the Mendelian errors column and select Activate by Threshold.
- Select <= 0 and click OK.
This will inactivate all the rows where there are Mendelian errors. We will use the active rows in this spreadsheet to activate their respective columns in the CEU All – Sheet 1 spreadsheet.
- From the PBAT QA Results (by Marker) spreadsheet go to Select > Apply Current Selection to Second Spreadsheet.
- Choose to apply filtered rows to CEU All – Sheet 1, then Click OK.
This will create a new spreadsheet, CEU All – Sheet 2, with 19,090 active columns. This tool will also inactivate the pedigree and phenotype columns–to reactivate these, left-click once on the Family ID column header, then while holding down the Shift button, click on the Age phenotype column header.
B. Quality Assurance by Sample
PBAT incorporates a novel test that assesses the genotyping quality of individual probands in family-based association studies. Published in PLoS Genetics Fardo, 2009 these tests are “ideally suited as the final layer of quality assurance filters in the cleaning process of genome-wide association studies.”
- Open CEU All – Sheet 2 and select Genotype > PBAT Family-Based QA.
- Again, check Use alternative rapid pedigree algorithm under Computation parameters.
- This time select Output by proband under Output and click Run.
Another new spreadsheet is created, PBAT QA Results (by Proband) (Figure 2-2), this time with quality control metrics for each proband. In the paper cited above, Fardo et al. suggest that, on a genome-wide scale, probands with a score greater than 30 are considered to have poor genotyping quality.
- Right-click on the Tgw column header and select Sort Descending.
Notice there are 5 samples with a Tgw value greater than 30. However, this particular dataset only contains genotypes for chromosome 22, so the statistics reported do not necessarily translate to a whole genome scale. Therefore, for this tutorial we will not exclude any samples.
3. Association Analysis
Now that important quality control metrics have been considered, you’re ready to run PBAT analysis on the remaining samples and SNPs. There are many different configurations of association tests and parameters one could run in PBAT. This tutorial covers a basic workflow. For more detailed information on the various options please reference the PBAT Family-Based Analysis section of the SVS Manual.
A. Run PBAT Genotype Analysis
- Open the CEU All – Sheet 2 spreadsheet and select Genotype > PBAT Genotype Analysis.
This will open the PBAT Genotype Analysis window. The first window enables you to select various phenotypes, predictor variables, interactions, and more for analyses. For this tutorial we will only consider Affection Status.
- Select Affection Status in the upper-left box on the Select Phenotypes tab.
- Click the Test Statistic and Computational tab.
- Check Output -log 10 p-values under Output Format.
- Leave all other parameters as defaults and click Run.
Upon completion a results spreadsheet, PBAT Results is created (Figure 3-1). This spreadsheet reports a number of statistics, of greatest interest being -log10 pvalue(FBAT) and power(FBAT). For a complete description of these and the other statistics reported please see the PBAT Family-Based Analysis section of the SVS Manual.
B. Plot Results
We will examine both the -log10 pvalue(FBAT) and power(FBAT) columns.
- From the PBAT Results spreadsheet, right-click on the -log10 pvalue (FBAT) column and select Plot Variable in GenomeBrowse.
- Zoom into Chromosome 22 by copy and pasting 22: 13,501,202 – 51,304,566 into the address bar at the top of the GenomeBrowse window.
This opens the plot viewer with -log10 pvalues displayed according to chromosome and position (Figure 3-2). You can add additional plots to this view from the User Graphs node in the Graph Control Interface.
- Go to File > Plot and click the Project button then select the PBAT Results spreadsheet and check the power(FBAT) item and then click Plot & Close.
You should now have two graphs in the plot viewer.