‹‹ Back to SVS Home

PBAT CNV Analysis

9.5 PBAT CNV Analysis

Summary

PBAT also supports testing of copy-number variation (CNV) data in a family-based setting [Ionita-Laza 2007].

The normal FBAT statistic is based on the coded genotypes of the family members being tested for each locus. These depend on the genetic model under consideration. Whereas, the CNV FBAT statistic is simply based on the intensity values themselves, or rather numbers derived from intensity values such as log 2 ratios. These intensity-derived values are used in place of the coded genotypes. This approach bypasses the need for a CNV genotyping algorithm to analyze CNV data.

To obtain the expected intensity value for an offspring, the intensity values of the respective parents are averaged. If the parental information is missing, the intensity values of the siblings are averaged. (This is in place of finding an expected genotypic coding based on the genotypes of the parents or the genotypes of the siblings.)

To obtain a variance, an empirical variance under the null hypothesis is used, since using Mendelian transmissions to compute the theoretical variance is not available in this context.

All robustness properties of the genotype FBAT approach are maintained in PBAT CNV analysis. In addition, all previously-developed FBAT extensions, including FBATs for time-to-onset, multivariate FBATs, and FBAT testing strategies, can be directly transferred to the analysis of copy-number variation.

The following PBAT CNV features are available in SVS:

  • Computation of CNV FBAT statistics for nuclear families and for extended pedigrees.
  • Multivariate CNV FBATs for multiple phenotypes: FBAT-GEE and FBAT-PC. FBAT-GEE is based on the generalized estimating equation approach.
  • Transformation tools for continuous phenotypes that are not normally distributed.
  • Including predictor variables in the CNV PBAT.
  • Including gene-environment/drug interactions in the CNV FBAT statistic.
Using PBAT CNV Analysis

Getting Started

The first step is to open an existing project or create a new project where you want to perform the data analysis and save the results. See Getting Started for more information on creating a new project or opening an existing one.

Once you have opened or created a project, you must import your pedigree and/or phenotype data into SVS. See Importing PBAT Family-Based Data for information on how to import pedigree and phenotype files. A properly imported pedigree file will have the six required pedigree columns at the front of the spreadsheet and the column name headers will have a blue background. See Special Features of a Pedigree Spreadsheet for more information about pedigree spreadsheets.

The copy number intensity data also needs to be loaded into the SVS project. There are many different ways to import in log ratio data. See the appropriate section Importing Your Data Into A Project that applies to your data format.

NOTES:

  1. When creating your pedigree, remember to list the parents, even if their genotype information is not known. This ensures that siblings are grouped together properly into families.
  2. If unrelated families are listed together using the same family ID, the results will be unpredictable.

If there is additional phenotype information to be used for the PBAT analysis, join the pedigree and phenotype spreadsheets together, keeping unmatched rows. See Figure 75. The resulting spreadsheet will keep the pedigree columns at the front of the spreadsheet, followed by the phenotype columns and then the CNV data. Note, the CNV columns must be marker mapped. See Figure 76.

NOTE:

  • You don’t have to have additional phenotype columns to perform a PBAT analysis, but if you do, you need to follow the above steps to join the phenotype dataset to the pedigree dataset.

[Picture]

Figure 75: CNV Pedigree spreadsheet joined to a CNV Phenotype spreadsheet


[Picture]

Figure 76: CNV Pedigree and Phenotype Spreadsheet joined to CNV data

PBAT CNV Analysis can be performed by opening a marker mapped pedigree spreadsheet with CNV data, activating the markers to be analyzed, and by selecting Analysis > PBAT CNV Analysis. A parameter selection dialog will open.

NOTE:

  • If you have many markers in your pedigree spreadsheet, it may be easiest to use Select > Column > Inactivate All Columns, to inactivate all columns. Then activate any phenotype columns as well as the columns for those markers you wish to analyze before opening the PBAT CNV Analysis dialog.

The parameters for PBAT CNV Analysis include phenotype (and other variable) selections, the type of analysis, type of screening, and parameters for phenotypes, haplotypes, test statistic and computational algorithm. In the parameter selection dialog, the parameters are organized into four tabs, which are:

Select Phenotypes

[Picture]

Figure 77: PBAT CNV Analysis dialog – Select Phenotypes tab

The Select Phenotypes tab of the dialog allows you to select the phenotypes to test. Figure 77 illustrates what the tab of this dialog looks like if there are additional phenotype columns joined to the pedigree and CNV data columns.

Phenotypes

In this list, select the phenotype or phenotypes to be analyzed for association with the selected markers. Multi-select operations are valid in these dialog boxes. These are: <Ctrl>-left-click selects multiple phenotypes one at a time, and <Shift>-left-click selects all phenotypes between the first and last selected phenotypes.

Phenotypes as predictor variables (covariates)

It may be possible that the selected phenotypes are not only associated with certain markers, but also are predicted by other phenotype variables (covariates for the test statistic). Select these other variables in this box to better determine the actual genetic effect after adjusting for the selected predictor variables.

When important covariates for the selected phenotypes are known, adding them to the conditional mean model [Lange 2002bLange 2002c] and also using them for the offset computation can increase the power of the FBAT statistic substantially.

Double-click on an item in this list to select or deselect it. An option dialog will appear. To select the variable, select the top radio button and enter the maximum power/order of the predictor variable. This determines the covariates that are added to the conditional mean model and to the offset value. For instance, entering “3” will add Xj, Xj2, and Xj3, where X is the selected predictor variable, to the model. To remove all orders of this predictor variable from the model, select the bottom radio button.

Phenotypes as interaction variables

To account for interactions of one or more phenotypic variables with the marker being tested
(“gene/covariate interactions”), select the interaction variables in this box.

Double-click on an item in this list to select it or deselect it. An offset selection dialog will appear. There are three options in this dialog, select the appropriate option:

  • Offset = mean: To use the mean of the selected variable as the offset, select this option.
  • Specify offset: Use this option to specify an offset for the selected variable, and enter in the offset value to the Offset value box.
  • Deselect this interaction variable: To remove the selected variable as an interaction variable select this option.

NOTE:

  • It is recommended that you use a particular offset choice here only when its effects need to be examined. In a standard data analysis, it is preferable to use “mean” here and allow all offsets to be computed by using one of the estimating procedures specified in the Offset drop-down menu on the next tab.

Subgroups

PBAT analyses may be divided into subgroups of patients (a stratified analysis). The outputs for the separate analyses of the subgroups will be provided on the same output spreadsheet, separated and categorized by subgroup.

To divide your patients into subgroups, click the box labeled Use a variable to define subgroups, and select one of the phenotype variables listed (this will be the grouping variable). Only binary, integer, and categorical variables can be used as grouping variables.

Select subgroup categories

Once the subgroup option is selected, this box becomes available and all subgroups for the selected variable are listed. Select the category or categories from the grouping variable to calculate the PBAT statistics on. Multi-select operations are available in this list box.

Censoring Variables for Time-to-Onset Analysis

Time-to-onset analysis is not available for SVS CNV PBAT. Thus it will not be possible to select a censor variable.

Phenotype Parameters

[Picture]

Figure 78: PBAT CNV Analysis dialog – Phenotype Parameters tab

The next tab in the PBAT CNV Analysis dialog is the Phenotype Parameters tab, see Figure 78.

Maximum and Minimum Number of Phenotypes per Group

  • FBAT-GEE statistic:
    If more than one phenotype is selected, the test can be performed against all of the phenotypes as one group, just one phenotype at a time, or any number of phenotypes combined together. Testing against more than one phenotype at a time will result in a multivariate test. To select the number of phenotypes to “group together” when testing, set the minimum and maximum number in the Min number of phenotypes per group and Max number of phenotypes per group.
  • FBAT-PC statistic:
    The FBAT-PC statistic may be used to find the relative weights of many phenotypes within a PBAT principal component analysis. Set both Max number of phenotypes per group and Min number of phenotypes per group to the number of phenotypes selected. FBAT-PC tests against every phenotype individually as a part of its analysis. Select the non-compact output format (Test Statistic and Computational) to see the weight of each phenotype within the principal component.

Offset Choice

The phenotype offset may be specified in this menu and, when applicable, the following text box.

The final trait used in FBAT calculations is the original phenotype value minus the offset.

The offset accomplishes two purposes:

  1. Increases the power of the FBAT statistic by offsetting the mean of the original phenotype from the trait.
  2. Incorporates covariates and interaction variables into the FBAT statistic.

The offset choices in this menu are:

  • No offset: No offset is used; only the original phenotype value is used. Neither covariates nor interaction variables are incorporated into the FBAT statistic. (Useful for affected-only analyses.)
  • Optimal power: Use the offset that maximizes the power of the FBAT-statistic (computationally slow, efficiency dependent on the correct choice of the mode of inheritance).
  • Phenotypic residuals (including E(X—HO)): Offset is based on standard phenotypic residuals obtained by GEE-estimation which includes the expected marker score (E(X|H0)) as well as all covariates and interaction variables. (This differs from standard phenotypic residuals only in the inclusion of the expected marker score.)
  • Standard phenotypic residuals: Offset is based on standard phenotypic residuals obtained by GEE-estimation which includes all covariates and interaction variables.

    In other words, the offset will be equal to the difference between the actual observed phenotype and a predicted phenotype. This predicted phenotype comes from a regression model that regresses the observed phenotype on all of the covariates in the dataset. If there are no covariates or interaction variables selected, this will constitute subtracting the mean phenotype value (for a continuous phenotype), or the sample prevalence (for a dichotomous phenotype).

  • Specify here: (User-specified offset.) Enter the offset to use in the text box to the right of this menu. (Useful for unaffected studies, use an offset of 1, or when a the effects of a particular offset need to be examined.)

Normally, it is recommended to use Standard phenotypic residuals, except in the case of affected-only studies, where it is normally recommended to use No offset.

Other possibilities include:

  • Unaffected-only studies (use an offset of 1).
  • Other studies using binary traits (use the disease prevalence).
  • Total population samples and ascertained samples where the quantitative trait is not highly correlated with the ascertainment criteria (the offset should approximate the phenotypic mean – use Standard phenotypic residuals).
  • Ascertained samples where the quantitative trait is highly correlated with the ascertainment criteria (dichotomize and set the offset to 0–No offset).

Compute All Predictor Sub-Models

Check the Compute all predictor sub-models box to use or not use the covariates (predictors) in all possible combinations, in separate tests.

Uncheck this box to use all of the covariates combined together in one test.

Transformations

The phenotypes can be used as is without a transformation, or the selected phenotypes can be transformed to ranks or Z-scores (normal scores). There is a similar choice for the selected predictor variables and also for the selected interaction variables. In practice, it is recommended to transform the data to normal scores, since the asymptotic convergence of the FBAT-statistic is robust against outliers and skewed data [Lange 2002a].

Alternative Rapid Pedigree Algorithm

Check Use alternative rapid pedigree algorithm to use a new algorithm for processing extended pedigrees. This is currently the default pedigree algorithm. Uncheck this box to use the standard pedigree algorithm.

Please see 9.4 for a full explanation of each of the two pedigree algorithms and the advantages and disadvantages of each of them when analyzing genotypic data.

Computationally, because a simple averaging technique is used to infer the expected marker scores, PBAT CNV analysis of extended pedigrees under the standard pedigree algorithm does not suffer from long computation times in the same way that PBAT analysis of genotypic data can under the same circumstances. However, for the sake of completeness, both pedigree algorithms are offered for CNV analysis.

Test Statistic and Computational

[Picture]

Figure 79: PBAT CNV Analysis dialog – Test Statistic and Computational tab

The next tab in the PBAT CNV Analysis dialog is the Test Statistic and Computational tab, see Figure 79. On this tab there are options to specify the test statistic parameters, computational parameters and the screening type.

Test Statistic Parameters

  • Test Statistics: select one of the following test statistics as appropriate.
    • FBAT-GEE: generalized estimating equation for FBAT. If one phenotype is selected, the FBAT-GEE statistic simplifies to the standard univariate FBAT-statistic. If several phenotypes are selected, all phenotypes are tested simultaneously using FBAT-GEE.

      For FBAT-GEE:

      • Both binary and continuous phenotypes will work.
      • Can combine phenotypes with different distributions (e.g. continuous and ordinal).
      • For each phenotype, an additional degree of freedom is used.
      • This statistic is not as good for a large number of phenotypes.

      Generally, the FBAT-GEE statistic can handle a moderate amount of any type of multivariate data, including groups of dichotomous phenotypes.

    • FBAT-PC: principal components FBAT extension for longitudinal phenotypes, repeated measurements and correlated phenotypes.

      This method tests a weighted sum of all the measurements, with the weights determined so as to maximize the genetic component of the overall phenotypes and to minimize the phenotypic/environmental variance. Generalized principal component analysis is used to determine these weights.

      For FBAT-PC:

      • All phenotypes must have the same distribution.
      • Degrees of freedom always equals one regardless of how many phenotypes are used.
      • As the number of phenotypes increases the power increases.
      • Quantitative phenotypes are preferable.
      • Good for a large number of phenotypes.
      • Can be its own type of marker “screening” test, since small genetic effects are amplified

      Generally, FBAT-PC is more powerful than FBAT-GEE if the phenotypes are correlated and quantitative.

  • Null Hypothesis: Specify the applicable null hypothesis from among the following options.
    • No linkage and no association: Standard hypothesis
    • Linkage and no association: Use if testing in a region with known linkage.
    • Linkage and no association (sw): Use if testing in a region with known linkage and there are large pedigrees. The empirical variance requires estimation of the correlation between all pedigree members, which can be unstable in large pedigrees. Here “sw” stands for “sandwich variance”, which is used to provide a more robust variance estimate.

GFBAT

To adjust the FBAT statistic for environmental correlation between the traits of multiple siblings in a family (GFBATs), select this option [Lange 2002b].

Computational Parameters

The following several options allow for the selecting of other necessary computational parameters.

  • Maximal iterations for GEE: Enter the maximal number of iteration steps in the GEE-estimation procedure. Enter “0” to use least-squares residuals. Otherwise, GEE residuals are computed (useful when multiple correlated phenotypes are analyzed). This choice will be active only if the FBAT-GEE statistic is selected.
  • Significance level: Enter the significance level to be used for the power calculations.

    Typically, 0.0005 might be used. However, for logrank tests, a higher significance level, such as 0.01, is preferable.

Output Format

The parameters in this box allow for indicating alternative and/or additional outputs to be included in the resulting spreadsheet.

  • Use compact output format: Select this option to output the shorter format that was developed for the database at the Channing Laboratories. This format is guaranteed to contain 17 columns plus a row label column for the marker names if Output -log 10 p-values is not selected. If -log 10 p-values are included in the output, an additional 3 columns are added.
  • Display p-values as signed numbers to show the direction of the main effect: Select this option to place a negative sign on the p-value when there is a negative correlation between the phenotype and the number of transmitted target/disease alleles. If this option is not selected, all p-values will be displayed as positive numbers.

    NOTE:

    • Signed p-values are not available when more than one phenotype is being tested at a time under FBAT-GEE, or when testing for interactions.
  • Output -log 10 p-values: Select this option to output log 10(p-value) for all p-values in the output, in addition to the p-values themselves.
Multiple Processes

[Picture]

Figure 80: PBAT CNV Analysis dialog – Multiple Processes tab

The next tab in the PBAT CNV Analysis dialog is the Multiple Processes tab, see Figure 80. On this tab there are options through which you can choose to run PBAT in multiple processes. This allows you to take advantage of multiple processors on a single machine by selecting Local Machine, or multiple machines in a distributed environment by selecting Run on Condor® Pool. If the option Divide Jobs Into Multiple Processes is not checked, PBAT will run normally on the current computer.

NOTE:

  • Dividing jobs into multiple processes is not allowed for haplotype analysis.

Local Machine

With the advent of dual-core and multiple processor systems as common desktop configurations, it is nice to take full advantage of the extra CPU resources available. It may also be convenient to divide analysis into multiple jobs for the purpose of keeping memory usage low when analyzing hundreds of thousands of SNPs.

When running multiple processes on a local machine, setting Maximum number of simultaneous jobs to be less than the total number of jobs will limit the number of jobs that can be run at one time. It is recommended to only run one concurrent job per processor. This will avoid memory access contention which severely impacts performance. So typically, this number should equal the number of processors and/or cores available on the current machine.

Run on Condor® Pool

Condor® is a freely available, specialized, batch system for managing compute-intensive jobs on a distributed network environment. Condor® and its extensive user manuals can be found at http://www.cs.wisc.edu/condor/. As Condor® is cross-platform, you can easily set up a Condor® pool on Windows, Linux or Mac OS X based systems and take advantage of a distributed computing environment with PBAT Genotype Analysis.

To run multiple jobs through Condor®, select the Run on Condor Pool option and browse to the location of the bin folder inside the directory where Condor® was installed on the system. Click Text to have SVS check that Condor® is configured and connected to a central manager.

It may be advantageous to specify the creation of more jobs than the number of machines available in the Condor® pool. Condor® will properly queue jobs and even out the effect of slower and faster computers taking longer or shorter times on each job.

For instructions on how to install Condor® on your network, see Appendix 17.

Output Spreadsheet

When all of the parameters are set, click Run to begin the analysis. A progress dialog will appear. The analysis may be stopped by pressing Cancel on the progress dialog.

If the PBAT analysis finishes normally, and results were obtained using the selected parameters, a results spreadsheet will be created and displayed. If no test has enough informative families for display, no output spreadsheet will be created.

Using Output for Screening

The main technique of using screening to filter which FBAT tests are considered uses the “Conditional Mean Model”.

In PBAT, the screening results are output into the same spreadsheet as the results from the actual FBAT tests. This allows sorting by the screening (power) results, and selecting only those results which have the most significant power. The FBAT tests which are contained in these same spreadsheet rows (indicating the tests with the most power) may be considered as if they had been calculated separately from the other FBAT tests, and the multiple-test correction applied only to these FBAT tests. This may be done because the screening tests are independent of the offspring genotype component of the FBAT tests themselves. Both the screening tests and the FBAT tests are conditioned on the same known quantities, namely the parental genotypes and the offspring phenotype(s).

Compact Format

This shorter format was developed for the database at the Channing Laboratories. It is guaranteed to contain 17 columns plus a row label column for the marker names unless Output -log 10 p-values is selected. An additional 3 columns will be added if log 10 p-values are included in the output.

The 17 columns are as follows:

  • Groupname: this is the grouping variable, if grouping is used. Otherwise, the column will be filled with the missing value “?”.
  • Group: this is the group variable value, if grouping is used. Otherwise, the column will be filled with the missing value “?”.
  • Allele: this column is not relevant for CNV analysis and will be filled with the missing value “?”.
  • Freq: this column is not relevant for CNV analysis and will be filled with the missing value “?”.
  • HWE: this column is not relevant for CNV analysis and will be filled with the missing value “?”.
  • phenos: phenotype(s) used.
  • cov: covariate(s) used, if any.
  • inter: interaction variable(s) used, if any.
  • model: this column is not relevant for CNV analysis and will be filled with 0’s.
  • test: statistical test used.
    • FBAT-GEE
    • FBAT-PC
  • #infofam: the number of families that were informative for this test.
  • pvalue: p-value for the FBAT statistic. This is for the main genetic effect, if this test had an interaction term.

    NOTES:

    1. If the GFBAT adjustment for environmental correlation has been specified, this statistic will reflect that adjustment.
    2. If you have specified Display p-values as signed numbers to show the direction of the main effect, a negative sign on the p-value will denote a negative correlation between the phenotype and the number of transmitted target/disease alleles.
  • power: this column is not relevant for CNV analysis and will be filled with the selected significance level.
  • wald: the result of the Wald test. The values here will only be meaningful if the conditional mean model would have been meaningful for this test.
  • herit: this column is not relevant for CNV analysis and will be filled with 0’s.
  • FBATI: joint p-value for the main effect and the interaction term. If no interaction term was selected, then a value of “1” will be returned.
  • powerFBATI: power for the FBAT interaction statistic, if an interaction term and screening with conditional power was selected.

If log 10 p-values are included in the output then the additional columns will be included in the output:

  • -log10 pvalue: log 10(pvalue), inserted to the right of the pvalue column
  • -log10 wald: log 10(wald), inserted to the right of the wald column
  • -log10 FBATI: log 10(FBATI), inserted to the right of the FBATI column

NOTE:

Normal Expanded Format

The normal expanded format output will have a varying number of columns, depending on the parameters selected and how many phenotypes are in the phenotype spreadsheet. Since a column will be present for every possible phenotype, the spreadsheet may be quite wide. However, all output statistics are visible in this format.

The output spreadsheet columns in the expanded format may be divided into several categories:

  • Row label with marker information
  • Subgroup designation
  • P-values
  • Phenotype columns
  • Extra columns for powers of predictor phenotypes, if necessary
  • Extra columns relating to FBAT-PC, if necessary
  • Extra columns relating to interactions, if necessary
  • -log10 columns for p-values (if this output option is selected)

The column groups are:

  • Marker information:
    The marker name is set as the row label.
  • Subgroup designation:
    If you have defined sub-groups of the population, the subgroup to which the analysis was restricted is shown in the first column. The missing value “?” in the first column means that all of the samples were analyzed.
  • P-values:
    The statistical outputs are listed in the following columns:
    • pvalue(FBAT): P-value for the FBAT statistic. This is for the main genetic effect, if this test also included an interaction term.

      NOTES:

      1. If the GFBAT adjustment for environmental correlation has been specified, this statistic will reflect that adjustment.
      2. If you have specified Display p-values as signed numbers to show the direction of the main effect, a negative sign on the p-value will denote a negative correlation between the phenotype and the number of transmitted target/disease alleles.
    • pvalue(FBATI): Joint p-value for the main effect and the interaction term. If no interaction term was selected, this column will be filled with ones.
    • pvalue(Wald): P-value of the overall Wald test for a genetic effect in the conditional mean model. These values will be meaningful only if the conditional mean model would have been appropriate for this test.
    • pvalue(WaldI): P-value of the overall Wald test for a gene/covariate interaction in the conditional mean model. These values will be meaningful only if the conditional mean model would have been appropriate for this test.
  • Phenotype columns:
    A column for every phenotype (including Affection Status) that is used in the model is shown. The following notation is used:
    • Not used in the analysis for this row.
    • Selected as a phenotype/trait and tested for association with FBATs in this row’s results.
    • Selected and used as a covariate/predictor variable. The 1’s indicate that the covariate/predictor variable is significant at both the 5% or the 1% significance levels in the conditional mean model.
    • Selected and used as a covariate/predictor variable. The 1 indicates that the covariate/predictor variable is significant at the 5% level, and the 0 indicates that it is not significant at the 1% level in the conditional mean model.
    • Selected and used as a covariate/predictor variable. The 0’s indicate that the covariate/predictor variable is not significant at either the 5% or the 1% significance levels in the conditional mean model.
    • Selected and used as an interaction variable in this row.
  • Extra columns for powers of predictor phenotypes, if necessary:
    If you used predictor variables with a maximum power greater than one, extra columns are included for the higher power phenotypes. The values for this column will be the same for predictor variables as above.
  • Extra columns relating to FBAT-PC, if necessary:
    If FBAT-PC has been selected as the test statistic, one additional column will be included in the output spreadsheet for every phenotype, indicating that phenotype’s weight in the FBAT-PC calculation.
  • Extra columns relating to interactions, if necessary:
    If one or more interaction variables are selected, additional columns will be included in the output spreadsheet. These columns are (in order):
    • main effect: An estimate of the regression coefficient for the main effect.
    • Std error: Standard error for the main effect coefficient.
    • p-value: P-value for the main effect coefficient.
    • interaction: An estimate of the regression coefficient for the interaction term.
    • Std error: Standard error for the interaction coefficient.
    • p-value: P-value for the interaction term coefficient.
    • FBAT-I: The FBAT statistic p-value for the interaction term coefficient (analogous to the above p-value for the interaction term and should have a similar value).
    • h-main: The heritability of the main effect.
    • h-interaction: The heritability of the interaction.
  • -log10 columns for p-values:
    Additional columns containing the log 10(p-value) will be added if this output option is selected. The additional columns will be:
    • -log10 pvalue(FBAT): log 10(pvalue(FBAT)), inserted to the right of the pvalue(FBAT) column
    • -log10 pvalue(FBATI): log 10(pvalue(FBATI)), inserted to the right of the pvalue(FBATI) column
    • -log10 pvalue(Wald): log 10(pvalue(Wald)), inserted to the right of the pvalue(Wald) column
    • -log10 pvalue(WaldI): log 10(pvalue(WaldI)), inserted to the right of the pvalue(WaldI) column