PBAT Data Analysis

23.4.1 Summary

The tools which have been implemented in PBAT for the analysis of quantitative and dichotomous traits are discussed in a series of papers by Lange and Laird ([Lange 2002a], [Lange 2002b] and [Lange 2002c]). They allow a variety of analysis possibilities:

  • Computation of a large variety of FBAT-statistics and their power for nuclear families and for extended pedigrees.
  • Multivariate FBATs for multiple phenotypes: FBAT-GEE and FBAT-PC. FBAT-GEE is based on the generalized estimating equation approach. FBAT-PC is based on principal components that maximize the heritability (see 23.6).
  • FBATs for time to onset data/survival data (logrank-FBAT and Wilcoxon-FBAT, FBAT-EXP).
  • Permutation tests for certain FBAT statistics.
  • Transformation tools for continuous phenotypes that are not normally distributed.
  • Conditional power calculations for all implemented FBATs.
  • Construction of the most powerful FBAT-statistic.
  • Including predictor variables in the FBAT.
  • Including gene-environment/drug interactions in the FBAT statistic.
  • Various estimation routines to estimate the genetic effect size.
  • Screening methods to select the most “promising” combinations of markers and phenotypes without biasing the significance level of the FBAT statistic computed subsequently.

23.4.2 Using PBAT Family-Based Analysis

23.4.2.1 Getting Started

The first step is to open an existing project or create a new project where you want to do the data analysis. See 3.1.2 for details about creating a new project. See 3.4.1 for details about opening an existing project.

Once you have opened or created a project you must import your pedigree data into Helix Tree. See 4.5 for details about importing pedigree and phenotype files. The pedigree file should be imported using the import data option from the file drop down menu PBAT -> Import FBAT Pedigree or the menu PBAT -> Import Text Pedigree. The phenotype file should be imported using the import data option from the file drop down menu PBAT -> Import FBAT Phenotype or the menu PBAT -> Import Text Phenotype.

NOTE: When creating your pedigree, remember to list the parents, even if their genotype information is not known, in order to group siblings together properly into families.

NOTE: If unrelated families are listed together using the same family ID, the results will be unpredictable.

Now that we have imported our data set(s) we can begin the analysis process.

You can perform PBAT family-based analysis by opening an FBAT pedigree spreadsheet, activating the markers to be analyzed, and selecting the Genetics -> PBAT Family-Based Analysis menu option. A parameter selection dialogue will open.

NOTE: If you have many markers in your pedigree spreadsheet, it may be easiest to use Edit -> Inactivate all columns, then activate the columns for those markers you wish to analyze (before selecting Genetics -> PBAT Family-Based Analysis).

The parameters for PBAT family-based analysis include phenotype (and other variable) selections, the type of analysis, type of screening, phenotype parameters, haplotype parameters, test statistic parameters and computational parameters. In the parameter selection dialog, the parameters are organized into four tabs, which are:

  • Select Phenotypes
  • Phenotype and Haplotype Parameters
  • Test Statistic and Computational
  • Multiple Processes

23.4.3 Select Phenotypes


[Picture]
Figure 23.14: The Select Phenotypes tab of the PBAT Analysis Options dialog.

This tab of the dialog allows you to select the phenotypes to test.

If you are using your pedigree spreadsheet’s affection status as your only phenotype, proceed to selecting it within the appropriate list or lists, as documented below.

Otherwise, to obtain the other phenotypes, select the applicable phenotype spreadsheet. Press the OR Select Phenotype Spreadsheet button–this will bring up a dialog containing all the navigator nodes associated with the current project. Phenotype spreadsheets will be highlighted in white. To select a phenotype spreadsheet, highlight the spreadsheet you wish to use and click the OK button.

NOTE: You may also use the same button (which will have now changed its name to (Select Other Phenotype Spreadsheet)) to change phenotype spreadsheets or to go back to using affection status only. (To go back to using affection status only, click OK without selecting any spreadsheets.)

NOTE: If a phenotype has completely missing data, it will not appear in any of the phenotype lists. (This may happen for the affection status, for instance, when all of your actual phenotype data is coming from a separate spreadsheet.)

23.4.3.1 Phenotypes

In this list, select the phenotype or phenotypes to be analyzed for association with your selected markers or with haplotypes from your selected markers. (Control-click selects or deselects additional phenotypes, shift-click selects a range of phenotypes.)

23.4.3.2 Predictor Variables/Covariates

It may be possible that your phenotypes are not only associated with certain markers or haplotypes, but also are predicted by other phenotypic variables (covariates for the test statistic). Select these other variables here to better determine the actual genetic effect you are searching for.

When important covariates for the selected phenotypes are known, adding them to the conditional mean model [Lange 2002bLange 2002c] and also using them for the offset computation can increase the power of the FBAT statistic substantially.

Double-click or right-click on an item in this list to select or deselect a predictor variable/covariate. A dialogue will appear. To select the variable, check the top radio button and enter the maximum power with which this variable may act as a predictor (the order up to which it will be added to the conditional mean model and to the offset value). For instance, entering “3” will cause this variable, the square of this variable, and the cube of this variable to all be used as covariates. To deselect, check the bottom radio button.

23.4.3.3 Phenotypes as Interaction Variables

To account for interactions of one or more phenotypic variables with the marker or haplotype being tested (“gene/covariate interactions”), select it (them) here.

Double-click or right-click on an item in this list to select or deselect it. A dialogue will appear. To select this phenotype as an interaction variable and use its mean as its offset, check the top radio button. To select this phenotype as an interaction variable and specify an offset, check the middle radio button and enter the required offset. To deselect, check the bottom radio button.

(NOTE: It is recommended that you use a particular offset choice here only when its effects need to be examined. In a standard data analysis, it is preferable to use “mean” here and allow all offsets to be computed by using one of the estimating procedures specified in the “Offset” drop-down menu on the next tab.)

23.4.3.4 Subgroups

PBAT analyses may be divided into subgroups of your patients (a stratified analysis). The outputs for the separate analyses of the subgroups will be provided on the same output spreadsheet, separated and categorized by subgroup.

To divide your patients into subgroups, click the box labeled Use a Variable to Define Subgroups, and select (click on) one of the phenotype variables listed (the grouping variable). NOTE: HelixTree only allows you to use a binary or an integer variable as a grouping variable.

When you have done this, a list of the possible categories (variable values) shows in the next panel. Select the category or categories by which to group the analyses. (Control-click selects or deselects additional categories, shift-click selects a range of categories.)

23.4.3.5 Censoring Variable for Time-to-Onset Analysis

To do time-to-onset analysis:

  • Select your time-to-onset variable in the upper left panel. This variable must be an integer.
  • Use the lower right panel to select a censoring variable. A censoring variable denotes whether the disease or condition has occurred at all during the study. It should be set to “1” (“not censored”) if the condition occurred (affected), and “0” (“censored”) if it did not (unaffected).
  • Select your other (phenotype, haplotype, and computational) parameters as necessary. FBAT-LOGRANK will have been automatically selected for you as your test statistic when you selected a censoring variable.

To select a censoring variable, click Use Censoring Variable (Time to Onset) and click on the phenotype to be used as a censoring variable.

23.4.4 Phenotype and Haplotype Parameters


[Picture]
Figure 23.15: The Phenotype and Haplotype tab of the PBAT Analysis Options dialog.

The next tab in the PBAT Analysis Options dialog is the Phenotype and Haplotype Parameters tab.

23.4.4.1 Maximum and Minimum Number of Phenotypes per Group
  • If you are using the FBAT-GEE statistic:

    If you have selected more than one phenotype, you can test against all of the phenotypes as one group, just one phenotype at a time, or any combinations inbetween. (Testing against more than one phenotype at a time will result in a multivariate test.) To select how many phenotypes to “group together” when testing, use Max Number of Phenotypes per Group and Min Number of Phenotypes per Group.

  • If you are using the FBAT-PC statistic:

    The FBAT-PC statistic may be used to find the relative weights of each of many phenotypes within a PBAT principal component analysis. Set both Max Number of Phenotypes per Group and Min Number of Phenotypes per Group to the number of phenotypes you have selected. FBAT-PC tests against every phenotype individually as a part of its analysis. Select the non-compact output format (23.4.5.12) to see the weight of each phenotype within the principal component.

23.4.4.2 Offset Choice

The phenotype offset may be specified in this drop-down and the following text-edit box.

The final trait used in the FBAT calculations is the original phenotype value minus whatever offset is used.

The offset accomplishes two purposes:

  1. Increases the power of the FBAT statistic by offsetting the mean of the original phenotype from the trait.
  2. Incorporates covariates and interaction variables into the FBAT statistic.

The offset choices in this drop-down are:

  • No offset: No offset is used–only the original phenotype value is used. Neither covariates nor interaction variables are incorporated into the FBAT statistic. (Useful for affected-only analyses.)
  • Optimal power: Use the offset that maximizes the power of the FBAT-statistic (computationally slow, efficiency dependent on the correct choice of the mode of inheritance).
  • Phenotypic residuals (including E(X|HO)): Offset is based on standard phenotypic residuals obtained by GEE-estimation which includes the expected marker score (E(X|H0)) as well as all covariates and interaction variables. (This differs from Standard phenotypic residuals (see below) only in the inclusion of the expected marker score.)
  • Standard phenotypic residuals: Offset is based on standard phenotypic residuals obtained by GEE-estimation which includes all covariates and interaction variables.

    In other words, the offset will be equal to the difference between the actual observed phenotype and a predicted phenotype. This predicted phenotype comes from a regression model that regresses the observed phenotype on all of the covariates in the dataset. If there are no covariates or interaction variables selected, this will constitute subtracting the mean phenotype value (for a continuous phenotype), or the sample prevalence (for a dichotomous phenotype).

  • Specify here: (User-specified offset) Allows you to specify the offset. Enter the offset you wish to use in the text-edit box to the right of this drop-down. (Useful for unaffected-only studies (use an offset of 1) or when you wish to examine the effects of a particular offset.)

Normally, it is recommended to use Standard phenotypic residuals, except in the case of affected-only studies, where it is normally recommended to use No offset.

Other possibilities include:

  • Unaffected-only studies (use an offset of 1)
  • Other studies using binary traits (use the disease prevalence)
  • Total population samples and ascertained samples where the quantitative trait is not highly correlated with the ascertainment criteria (the offset should approximate the phenotypic mean–use Standard phenotypic residuals)
  • Ascertained samples where the quantitative trait is highly correlated with the ascertainment criteria (Dichotomize and set the offset to zero–No offset).

23.4.4.3 Compute All Predictor Sub-Models

Check this box to use or not use the covariates (predictors) in all possible combinations, in separate tests.

Uncheck this box to use all of the covariates combined together in one test.

23.4.4.4 Transformations

We can leave the phenotypes untransformed, or we can transform the selected phenotypes to ranks or to Z-scores (normal scores). We have a similar choice for the selected predictor variables and also for the selected interaction variables. Use the appropriate drop-down for each transformation choice. In practice, it is recommended to transform the data to normal scores, since the asymptotic convergence of the FBAT-statistic is robust against outliers and skewed data [Lange 2002a].

23.4.4.5 Multi-Marker and Multi-Phenotype Testing (MFBAT)

For the most common SNP and haplotype tests, multiple markers and/or multiple phenotypes may be subjected to both FBAT-GEE multivariate tests and FBAT-PC tests after the original analysis has finished. These tests are collectively referred to as “MFBAT” tests.

MFBAT testing may be done in conjunction with two common use cases:

  • SNP Testing. No interactions may be specified, no grouping of phenotypes is allowed, and Compute All Predictor Sub-Models must be unchecked. All combinations of M phenotypes and all combinations of N markers, where M and N take on all values within the bounds you have specified for numbers of phenotypes and numbers of markers, respectively, will be tested. The MFBAT output will follow after the output for all of the individual markers.
  • Haplotype testing using sub-haplotypes. (This includes the “Rapid Additive Model” test–see below.) In this mode of MFBAT testing, only the number of multiple phenotypes may be specified. The “one marker” that will be used in this case will actually be the haplotype just tested. All combinations of M phenotypes, where M will take on all values within the bounds you have specified for numbers of phenotypes, will be tested. The MFBAT output for any individual haplotype will follow after the output for that haplotype’s test.

Check Perform MFBAT Tests to perform these tests. Fill in the maximum and minimum numbers of phenotypes and of markers (if applicable) to be tested at a time.

The outputs will be identified by marker names separated by plus signs or a single marker name with a plus sign after it. The phrase “FBAT-GEEˆ  2” or “FBAT-PCˆ  2” will be used in place of an allele number or a haplotype designation to identify the test.

A p-value will appear in the normal p-value column (either “pvalue(FBAT)” or “FBAT-Wilcoxon”) for the “FBAT-GEEˆ
2” test, and two power-related values will appear for the “FBAT-PCˆ  2” test, appearing in the normal p-value column and the next column to the right of it.

NOTE: MFBAT testing is valid for either the FBAT-GEE or FBAT-LOGRANK test statistic (selectable on the next tab).

NOTE: For FBAT-LOGRANK (time-to-onset) testing using haplotypes, an FBAT-GEE test is made for the censor variable under the haplotype being tested. This test is output after the original test and before the MFBAT test. The output fields fields are the same as for an FBAT-GEE test.

Check Use Simplified Variance Structure to simplify (average out rows in) the variance/covariance matrix used in the FBAT-PC calculations, thereby improving performance for the larger matrices.

23.4.4.6 Alternative Rapid Pedigree Algorithm

Check Use Alternative Rapid Pedigree Algorithm (NOTE: This algorithm is still experimental.) to use a new algorithm for processing extended pedigrees.

This new algorithm combines the advantages of the following two strategies:

  • Breaking up the extended pedigrees into nuclear families, which is a computationally fast strategy, but does not take full advantage of the structure of the known extended pedigree.
  • Analyzing extended pedigrees as such, which takes full advantage of all the information and is the most powerful option, but can be computationally slow when many of the genotypes in a pedigree are missing.

The extended pedigree algorithm is particularly slow in the situation in which nuclear families (i.e. all genotypes are known) in an extended pedigree are linked only by two or more family members for whom genotypic information is not available. Another situation is of an extended pedigree with “isolated genotypes”, that is, spare genotypic information spread across the entire pedigree. In either situation, the power gain is minimal and sometimes even jeopardized by the possibility that the linking family member or members have to be removed when the maximum number of founders is reached in PBAT.

The new algorithm in PBAT identifies clusters of nuclear families in extended pedigrees that are directly linked (i.e. that share a family member) and analyzes such clusters as extended pedigrees. At the same time, clusters that are linked only through two or more family members without genetic information are broken up into separate extended-pedigree clusters. These clusters are analyzed in the same way that extended pedigrees would be under the original algorithm, but independently of each other.

The extra information provided to the computation of the genetic distribution under the original algorithm by linking together the extended-pedigree clusters is minimal, while the effort required for taking advantage of this information is disproportionately enormous. This puts the original algorithm at a severe disadvantage.

Under the new hybrid approach, however, such links between family clusters within extended pedigrees are dropped. The increased statistical power of the original extended pedigree algorithm is therefore maintained while almost having the computational speed of a pure nuclear-family analysis.

23.4.4.7 Rapid Additive Model Analysis

To perform a rapid scan of markers using only one test per marker, check the Perform Rapid Additive Model Analysis box. The major allele for the marker being tested will be used as the “haplotype” for a haplotype test using the additive model. This is repeated over all selected markers.

Since you obtain the same p-value and power for both the major and minor alleles in a single-marker test using the additive model, testing only one allele this way does not lose any information.

NOTE: When this box is checked, the following parameters in the interface (see below for their explanations) are pre-set and grayed out:

  • Haplotype analysis
  • Sub-haplotypes of length 1
  • Additive genetic model

(While you could set the above parameters individually, this box gives you a “one-check” selection of this mode.)

NOTE: The only way to obtain distributed processing for haplotype analysis is to check this box. Distributed processing is not normally allowed for haplotype analysis.

23.4.4.8 Permutation Testing

Permutation testing may be selected for either Rapid Additive Model analysis or for other modes of haplotype analysis. Check the Use Permutation Testing to Obtain P-Values box, and enter the Number of Permutations in the blank.

23.4.4.9 Haplotype Analysis

To perform haplotype analysis, check the Perform Analysis for Haplotypes group box. The haplotype-related choices delineated in the following subsubsections will then become active.

NOTE: If any enabled marker is multi-allelic, haplotype testing will select only those two alleles that are most prevalent, and treat the marker as if it is bi-allelic with these two alleles.

Check Overall Haplotype Test to additionally perform an overall haplotype test. This is a multivariate test performed on all the haplotypes whose frequency is greater than the cutoff frequency (see below).

NOTE: Checking this option is only valid when the Analyze All Sub-Haplotypes option is not checked.

NOTE: Checking this option is also only valid if only one level of grouping (using the Subgroups box in the first tab–see 23.4.3.4) is used, or if no explicit subgrouping is used at all.

When the Overall Haplotype Test box is checked, the Cutoff Frequency For Overall Haplotype Test field becomes active. Use this field to enter the minimum frequency a haplotype must have for inclusion in the overall test.

Check Analyze All Sub-Haplotypes to analyze haplotypes that are defined by subsets of the currently selected markers. Checking this box will also activate the Length of Sub-Haplotypes (0 for any length) field. If zero is entered, haplotypes from every subset (proper or not) of the SNPs will be analyzed. Entering zero is also not allowed when more than 8 markers have been selected. If a non-zero number is entered in this field, only sub-haplotypes of length equal to that number of SNPs will be analyzed. The sub-haplotype length is not allowed to exceed 8.

In addition, if a number greater than one and less than the total number of markers selected is entered for Length of Sub-Haplotypes (0 for any length), the Only Sub-Haplotypes Defined by Adjacent SNPs field is activated. Checking this will effectively cause the sub-haplotypes to be analyzed in a moving window. Unchecking this, which is not allowed for more than 20 total SNPs, will go through all combinations of the selected SNPs taken sub-haplotype-length at a time, and can be very slow because of the large quantity of calculation and output requested.

(If the Analyze All Sub-Haplotypes box is not checked, only the haplotypes defined by all the SNPs activated in the pedigree spreadsheet are analyzed, while no haplotypes defined by any (proper) subset of these SNPs are analyzed. Only 8 SNPs may be activated for analysis in this mode.)

Check Infer Missing Genotypes in Haplotypes to include individuals with missing genotype information in the analysis. The algorithm of [Horvath et al (2004)] is applied to all individuals, even if they have missing genotype information. Unfortunately, this can result in a greater number of ambiguous haplotypes.

NOTE: Also unfortunately, selecting this option (inferring missing genotypes in haplotypes) can be much more compute-intensive.

If this box is not checked, individuals with missing genotype information will be excluded from the analysis.

Check Remove Ambiguous Haplotypes From the Analysis to exclude ambiguous haplotypes from the analysis.

Normally, ambiguous haplotypes (possible haplotypes which cannot be inferred from the parental genotypes) are included in the analysis and are weighted according to their estimated frequencies in the probands.

Enter the Maximal Number of Mating Types for Computation for use in the haplotype analysis.

(One mating type is one combination of what the father’s haplotype pair (diplotype) and the mother’s haplotype pair (diplotype) might be.)

Using 100 is sufficient for most haplotype calculations. Use fewer to speed up the calculations. Use more to be more certain to get all mating types.

23.4.5 Test Statistic and Computational


[Picture]
Figure 23.16: The Test Statistic and Computational tab of the PBAT Analysis Options dialog.

Under this menu there are options through which you can select the test statistic parameters, computational parameters, and the screening type.

23.4.5.1 Test Statistic

The test statistic is the first of the three test statistic parameters.

Three test statistics are available:

  • FBAT-GEE (generalized estimating equation for FBAT) If one phenotype is selected, the FBAT-GEE statistic simplifies to the standard univariate FBAT-statistic. If several phenotypes are selected, all phenotypes are tested simultaneously using FBAT-GEE.

    For FBAT-GEE:

    • Both binary and continuous phenotypes will work
    • Can combine phenotypes with different distributions (e.g. continuous and ordinal)
    • For each phenotype, an additional degree of freedom is used
    • Not as good for a large number of phenotypes

    Generally, the FBAT-GEE statistic can handle a moderate amount of any type of multivariate data, including groups of dichotomous phenotypes.

  • FBAT-PC (principal components) FBAT extension for longitudinal phenotypes, repeated measurements and correlated phenotypes.

    This method tests a weighted sum of all the measurements, with the weights determined so as to maximize the genetic component of the overall phenotypes and to minimize the phenotypic/environmental variance. Generalized principal component analysis is used to determine these weights.

    For FBAT-PC:

    • All phenotypes must have the same distribution
    • Degrees of freedom = 1 regardless of how many phenotypes are used
    • As the number of phenotypes increases the power increases
    • Quantitative phenotypes are preferable
    • Good for a large number of phenotypes
    • Can be its own type of marker “screening” test, since small genetic effects are amplified

    Generally, FBAT-PC is more powerful than FBAT-GEE if the phenotypes are correlated and quantitative.

  • FBAT-LOGRANK and FBAT-Wilcoxon are FBAT-extensions of the classical LOGRANK and WILCOXON tests for time-to-onset data.

Check the applicable radio button.

23.4.5.2 Genetic Model

The mode of inheritance of the target/disease allele and the underlying genetic model can be selected here. Choose “Additive”, “Dominant”, “Recessive”, or “Heterozygous Advantage”. Alternatively, choose “All” to get separate outputs for all four possible models.

23.4.5.3 Null Hypothesis

Specify the applicable null hypothesis in the second drop-down. These are

  • No linkage and no association (Standard hypothesis.)
  • Linkage and no association (Use if testing in a region with known linkage.)
  • Linkage and no association (sw) (Use if testing in a region with known linkage and there are large pedigrees. The empirical variance requires estimation of the correlation between all pedigree members, which can be unstable in large pedigrees. “(sw)” stands for “sandwich variance”, which is used to provide a more robust variance estimate.)

23.4.5.4 Screening Type

Screening is useful when the phenotypes with the strongest genetic components are not known prior to the analysis and several markers have to be analyzed. The screening technique is particularly useful to handle the multiple comparison problem in genome-wide association studies. Additionally, screening can help the user to decide whether a study had sufficient power to detect a significant association. See 23.4.7.6 for how screening is output from PBAT.

Screening is an intregal part of the workflow of PBAT, which, for continuous phenotypes, is called the “Conditional Mean Model”.

Two types of screening are available for continuous phenotypes. Both are based upon a genetic effect size estimate (i.e. β) which is obtained by regressing the observed offspring phenotypes on the expected offspring genotype (given the parental genotypes). The larger the genetic effect size, the larger the estimated power of the FBAT test.

The two screening types are:

  • Screening Based on Conditional Power Calculations (parametric approach) The conditional power is the probability that the FBAT test is rejected given the offspring phenotype and the parental genotypes. Under the “Conditional Mean Model”, the genetic effect size (β) is used to obtain the expected value and the variance of the marker scores (i.e. offspring genotypes) under the alternative hypothesis, and thus to obtain the conditional power.

    In general, the conditional power test is recommended over the Wald test (see below) because the Wald test is a population-based estimate of the genetic effect size. Unlike the conditional power calculation, it does not require model assumptions under the alternative hypothesis, which is why it is a called a non-parametric screening approach.

    However, since the Wald test is a purely population-based approach, it is generally less powerful than the conditional power, especially when population stratification may be present [Lange 2002c].

    Unfortunately, the conditional power method is more computationally intensive if there are very large pedigrees in the dataset. The non-parametric Wald test will run more quickly in these cases.

  • Screening Based on Non-Parametric Approach (Wald-tests) For the Wald test, the genetic effect size is directly tested [i.e.. (H0: β = 0)].

    This method is recommended for use with continuous phenotypes that have extended pedigrees.

For other types of studies which do not use continuous phenotypes, please use Screening Based on Conditional Power Calculations, and please see the subsection on the Empirical Distribution For Phenotypes (23.4.5.8) below.

Check the radio button for the desired screening approach.

23.4.5.5 GFBAT

To adjust the FBAT statistic for environmental correlations (GFBATs), check this (GFBAT) box.

This option is only available if you have selected at least one interaction (environmental correlation) variable in the first tab.

23.4.5.6 Computational Parameters

The following several options allow the selecting of other necessary computational parameters.

23.4.5.7 Maximal Number of Non-Founders in One Pedigree

Enter the maximal number of non-founders, or siblings, in one pedigree. If a pedigree is found to have more than this number of non-founders, it will be broken up into smaller pedigrees.

NOTE: Selecting more than about seven maximal non-founders when your actual pedigrees have more than seven non-founders can become computationally intensive–very much so if you are screening by conditional power. Screening based on the non-parametric approach is therefore recommended as one possible remedy for this situation.

NOTE: If you select fewer maximal non-founders than your actual pedigrees have, your results may depend upon how your data is sorted. This is because the process of breaking up a larger pedigree into smaller pedigrees is dependent upon the order in which the larger pedigree is read in.

23.4.5.8 Empirical Distribution For Phenotypes

The main technique of using screening to filter which FBAT tests are considered uses the “Conditional Mean Model”. However, the “Conditional Mean Model” assumes continuous phenotypes are being used. Otherwise, a different method of obtaining conditional power needs to be used. This is because to obtain conditional power, the expected value/variance of the marker score under HA must be estimated.

The following distributions for phenotypes may be selected:

  • Continuous Phenotypes. (The “Conditional Mean Model” will be used for power calculations.)
  • The approach by Jiang et al (2006). (Use this for time-to-onset calculations. Also use if there is no a priori belief that association will only be observed in affected individuals.)
  • The approach by Murphy et al (2006). (Use for affected-only studies or categorical phenotypes.)
  • Naive allele freq estimator. (The allele frequencies used for screening are estimated from the parents’ genotypes.) (Alternative to the Murphy method for affected-only studies or categorical phenotypes if there is a reason why the assumption about the relationship of the penetrance functions under the alternative hypothesis made by the Murphy method might be violated under the alternative hypothesis.)
  • Observed allele frequencies. (Another alternative to the Murphy method for affected-only studies or categorical phenotypes.)

NOTE: If you select any distribution for your phenotype other than Continuous phenotypes, your phenotype variable should either be the Affection Status or have non-missing category numbers ranging between zero and 199, inclusive.

23.4.5.9 Min. Number of Informative Families

“Informative families” are those families for which power and p-value statistics were able to be computed.

Specify the minimum number of informative families required for the display of the FBAT-statistics. If you enter zero, statistics on all tests will be displayed.

In a typical analysis, it is not recommended to include markers with fewer than 20 informative families.

23.4.5.10 Maximal Iterations for GEE

Enter the maximal number of iteration steps in the GEE-estimation procedure. Enter zero to use least-squares residuals. Otherwise, GEE-residuals are computed (useful when multiple correlated phenotypes are analyzed). (This choice will be active only if you have chosen the FBAT-GEE statistic.)

23.4.5.11 Significance Level

Enter the significance level to be used for the power calculations.

Typically, .0005 might be used. However, for logrank tests, a higher level, such as .01, is preferable.

23.4.5.12 Output Format

Select Use Compact Output Format to output the shorter format that was developed for the database at the Channing Laboratories. This format is guaranteed to contain 17 columns plus a row label column for the marker names.

Uncheck Use Compact Output Format to output the original longer format. The number of columns for this format may vary, but certain statistical results are only reported in this long format.

23.4.5.13 Signed P-Values

Select Display P-Values As Signed Numbers To Show the Direction of the Main Effect to place a negative sign on the p-value when there is a negative correlation between the phenotype and the number of transmitted target/disease alleles.

Uncheck this option to display all p-values as positive numbers.

NOTE: The signed p-value is a more reliable indicator of the direction of the effect than is the heritability output, which is only an approximation to the direction of the effect.

NOTE: Signed p-values will only be displayed when the FBAT-GEE test statistic is used.

23.4.6 Multiple Processes


[Picture]
Figure 23.17: The Multiple Processes tab of the PBAT Analysis Options dialog.

Under this tab there are options through which you can choose to run PBAT analysis in multiple processes. This allows you to take advantage of multiple processors on a single machine by selecting Local machine, or multiple machines in a distributed environment by selecting Run on Condor○R pool or Run on United Devices Grid MP. If the option Divide Job Into Multiple Processes is unchecked, PBAT will run normally on the current computer.

23.4.6.1 Running on a Local Machine

With the advent of dual-core and multiple processor systems as common desktop configurations, it is nice to take full advantage of the extra CPU resources available. It may also be convenient to divide analysis into multiple jobs for the purpose of keeping memory usage low when analyzing hundreds of thousands of SNPs.

When running multiple processes on a local machine, setting Maximum number of simultaneous jobs to be less than the total number of jobs will limit the number of jobs that can run at one time. It is recommended to only run one concurrent job per processor. This will avoid memory access contention which severely impacts performance. So typically, Maximum number of simultaneous jobs should equate to the number of processors and/or cores available on a machine.

23.4.6.2 Run on Condor○R pool

CondorR○ is a freely available, specialized, batch system for managing compute-intensive jobs on a distributed network environment. Condor○R and it’s extensive user manuals can be found at http://www.cs.wisc.edu/condor/. As CondorR○ is cross-platform, you can easily set up a Condor pool on your Windows, Linux or Mac OS X based systems and take advantage of a distributed computing environment with PBAT Family-Based Analysis.

To run multiple jobs through Condor○R, select the Run on Condor pool option and browse to the location of the bin folder inside the directory where Condor○R was installed on the system. Click Test to have HelixTree check that Condor○R is configured and connected to a central manager.

It may be advantageous to specify the creation of more jobs than the number of machines available in the Condor○R pool. Condor will properly queue jobs and even out the effect of slower and faster computers taking longer or shorter times on each job.

For instructions on how to install Condor○R on your network, see Appendix B.

23.4.6.3 Run on United Devices Grid MPTM

United Devices Grid MPTM ( http://www.ud.com/products/gridmp.php) is a commercial platform for distributed computing in the enterprise environment. HelixTree can create and submit jobs to the the Grid MPTM platform, display the progress of running jobs previously submitted, and retrieve the results back into HelixTree.

To be able to run jobs on the Grid MPTM platform, you must specify the name of the RPC and File Servers. In most environments these are one and the same. To specify the server, select Run on United Devices Grid MP and enter the fully qualified server name into the Grid MP Platform Server Name field. You also will need to receive login credentials from your Grid MPTM administrator and enter them in the Advanced Options (see below).

The Job Description field allows you to uniquely identify the job representing the analysis to be performed. This field will be the job description of the job as it is viewable on the Grid MP TM Platform through the web based management console. It will also be the name of results spreadsheet once the job is completed and the results are imported. You can also identify a job by its unique Job ID which will be displayed in the progress dialog of running jobs as well as the log of the progress node for backgrounded jobs. (see 23.4.6.5)

To test that you have sufficiently configured the Grid MP TM Platform to be able to create and delete Jobs and Data Sets, click the Test option in the United Devices area of the tab.

23.4.6.4 United Devices Advanced Options


[Picture]
Figure 23.18: Advanced Options for the United Devices Grid MPTM.

Advanced options for the United Devices Grid MPTM platform are accessible by clicking Advanced Options in the United Devices area of the Multiple Processes tab.

The User Name and Password fields should be filled in with the login credentials that your Grid MPTM administrator provides you.

To be able to run PBAT HelixTree on your Grid MPTM network, the administrator must first install the PBAT Program Module. Please contact support@goldenhelix.com to receive the latest version of this module as well as instructions for installing it on the grid. The instructions encourage naming the Application and Program installed on the grid as PBAT HelixTree. If the administrator chooses to use a different name for either the Applcation or Program, that name must be entered in this dialog in the corresponding field.

The Errors per Workunit and Concurrent Dispatches per Workunit options are parameters for running jobs on the grid that deal with redundancy and error handling. The value of 5 is a common default, but your administrator may prefer to use other values.

If your grid infrastructure is set up so that your RPC server is different than your file server, you can check the Use Custom URLs box to allow for more precise control over where these two services are located on your network.

Finally, for debugging purposes it is sometimes useful to not delete the Job and Data Set entities on the Grid MP server after a job has been completed. If this is the desired behavior, check the Don’t delete Job and Data Set after successful analysis box.

23.4.6.5 United Devices Backgrounded Jobs

Because the type of PBAT Analysis that is a good candidate for being distributed will often still take a considerable amount of time to complete, HelixTree allows you to move the progress of a United Devices distributed job to the background, and come back to it later.

Once a job has started, the progress dialog will have a Background button. If you click on this button, the progress dialog will close and a node will be created as a child of the current pedigree spreadsheet. The current open project can be saved, closed and reopened at another time and the node will remain available to access the progress of the backgrounded job.

Opening the progress node (double clicking on it) will display the progress dialog for the corresponding job. It may take a few seconds for the dialog to update to the current status of the distributed job. If the job is complete, the results will be immediately imported and the progress node will be replaced with the resulting imported data.

To cancel the United Devices job, delete the progress node. This action will prompt you for confirmation, indicating that deleting the node will delete the job from the grid, cancel current running jobs and discard any completed results.

23.4.7 Output Spreadsheet

When all of the parameters are set, click Run Analysis to begin the analysis. A progress bar will show. The analysis may be stopped by pressing Cancel on the progress bar.

(Alternatively, click Save Options to just save the currently-set options.)

If the PBAT analysis finishes normally, and results were obtained using the selected parameters, a results spreadsheet will be created and displayed. If no test has enough informative families for display, no output spreadsheet will be created.

23.4.7.1 Compact Format

This shorter format was developed for the database at the Channing Laboratories. It is guaranteed to contain 17 columns plus a row label column for the marker names (or marker combinations for haplotype analysis). These columns are:

  • Groupname This is the grouping variable, if grouping is used.
  • Group This is the group variable value, if grouping is used. Otherwise, “-999” will be shown.
  • Allele The allele or haplotype tested.
  • Freq Allele or allele combination frequency.
  • HWE P-value of the Hardy-Weinberg test for the parents.
  • phenos Phenotype or phenotypes used.
  • cov Covariate or covariates used, if any.
  • inter Interaction variable or variables used, if any.
  • model The genetic model for this test. 0 = additive, 1 = dominant, 2 = recessive, 3 = heterozygous advantage.
  • test Statistical test used. 1 = FBAT-GEE, 2 = FBAT-PC, 3 = FBAT-LOGRANK, 4 = FBAT-Wilcoxon, 5 = optimal FBAT-LOGRANK (naive weights).
  • #infofam The number of families that were informative for this test.
  • pvalue P-value for the FBAT statistic. This is for the main genetic effect, if this test has been done with an interaction term.

    NOTE: If the GFBAT adjustment for environmental correlation has been specified, this statistic will reflect that adjustment.

    NOTE: If you have specified Display P-Values As Signed Numbers To Show the Direction of the Main Effect, a negative sign on the p-value will denote a negative correlation between the phenotype and the number of transmitted target/disease alleles.

  • power Conditional power estimate, if screening with conditional power has been selected.
  • wald The result of the Wald test. Will be meaningful only if the conditional mean model would have been meaningful for this test.
  • herit The heritability of this trait. The heritability is defined as the proportion of phenotypic variance explained by the analyzed marker. A negative sign denotes a negative correlation between the phenotype and the number of transmitted target/disease alleles.
  • FBATI Joint p-value for the main effect and the interaction term, if this test has been done with an interaction term.
  • powerFBATI Power for the FBAT interaction statistic, if this test has been done with an interaction term, and screening with conditional power has been selected.

NOTE: See 23.4.7.6 concerning output for screening tests vs. output for FBAT tests.

23.4.7.2 Normal Expanded Format

The normal expanded-format output will have a varying number of columns, depending upon the parameters selected and how many phenotypes are in the phenotype spreadsheet. Since a column will be present for every possible phenotype, the spreadsheet may be quite wide. However, all output statistics are visible in this format.

See 23.4.8 for the time-to-onset analysis output fields in the expanded format. Otherwise, the output spreadsheet columns in the expanded format may be divided into several categories:

  • Row header with marker information
  • Subgroup designation
  • Allele information and genetic model
  • P-value and power values
  • A column for every phenotype (including the Affection Status)
  • Extra columns for powers of predictor phenotypes, if necessary
  • Heritability
  • Extra columns relating to interactions, if necessary

NOTE: see 23.4.4.5 for the extra outputs added when you perform MFBAT testing.

23.4.7.3 Marker information

For SNP analysis, the marker (SNP) name labels the row. For haplotype analysis, the involved markers (SNPs) separated by periods label the row.

23.4.7.4 Subgroup designation

If you have defined sub-groups of the population, the subgroup to which the analysis was restricted is shown in the first column. The value “-999” in the first column means that the total sample was analyzed.

23.4.7.5 Marker and allele information and genetic model

Next, for SNP analysis, the number of the allele being tested is shown, followed by the following information:

  • Allele frequency overall
  • Hardy-Weinberg p-value overall
  • Allele frequency for the parents
  • Hardy-Weinberg for the parents

For haplotype analysis, the outputs are instead:

  • the haplotype (the respective alleles separated by colons)
  • the haplotype frequency

These are followed (for both SNP and haplotype analysis) by a column for the genetic model. The additive model is labeled as “0”, the dominant model as “1”, the recessive model as “2”, and the heterozygous advantages model as “3”.

If you selected “All” genetic models, the analysis will have been run not only for each marker and allele, but also for each model. In this case, an entry in this column will show which genetic model was used for that row’s analysis.

Following the genetic model is shown the number of informative families for this marker and allele (or for this haplotype).

23.4.7.6 Output from PBAT Analysis

The main technique of using screening to filter which FBAT tests are considered uses the “Conditional Mean Model”.

In PBAT, the screening results are output into the same spreadsheet as the results from the actual FBAT tests. This allows sorting by the screening (power) results, and selecting only those results which have the most significant power. The FBAT tests which are contained in these same spreadsheet rows (indicating the tests with the most power) may be considered as if they had been done separately from the other FBAT tests, and the multiple-test correction applied only to these FBAT tests. This may be done because the screening tests are independent of the offspring genotype component of the FBAT tests themselves. Both the screening tests and the FBAT tests are conditioned on the same known quantities, namely the parental genotypes and the offspring phenotype(s).

Also output are results from the Wald (non-parametric) tests. See 23.4.5.4 for a description of this type of test.

NOTE: If the phenotypes are not continuous, an empirical distribution for the phenotypes other than “Continuous Phenotypes” should have been selected. (See 23.4.5.8.) The Wald test is not valid for non-continuous phenotypes.

23.4.7.7 P-value and power values

After the marker and allele information and the genetic model are shown, the following statistical outputs are shown:

  • pvalue(FBAT) P-value for the FBAT statistic. This is for the main genetic effect, if this test has been done with an interaction term.

    NOTE: If the GFBAT adjustment for environmental correlation has been specified, this statistic will reflect that adjustment.

    NOTE: If you have specified Display P-Values As Signed Numbers To Show the Direction of the Main Effect, a negative sign on the p-value will denote a negative correlation between the phenotype and the number of transmitted target/disease alleles.

  • pvalue(FBATI) Joint p-value for the main effect and the interaction term, if this test has been done with an interaction term.
  • power(FBAT) Conditional power estimate, if screening with conditional power has been selected.
  • power(FBATI) Power for the FBAT interaction statistic, if this test has been done with an interaction term, and screening with conditional power has been selected.
  • pvalue(Wald) p-value of the overall Wald-test for a genetic effect in the conditional mean model. Will be meaningful only if the conditional mean model would have been meaningful for this test.
  • pvalue(WaldI) p-value of the overall Wald-test for a gene/covariate interaction in the conditional mean model. Will be meaningful only if the conditional mean model would have been meaningful for this test.

23.4.7.8 A column for every phenotype (including the Affection Status)

A column for every phenotype (including the Affection Status) is shown. The following notation is used:

  • “0”: Not selected, or selected but not used in the analysis output in this row.
  • “1”: Selected as a phenotype/trait and tested for association with FBATs in this row’s results.
  • “P”: Selected and used as a covariate/predictor variable. “P” is then followed by groups of two indicator variables. Each group corresponds to one selected phenotype/trait in the order that they are listed in the table. The first indicator variable in each group is “1” if the predictor variable/covariate is significant at 5%-level in the conditional mean model for the corresponding phenotype/trait. Otherwise, it is 0. The second indicator variable in each group is “1” if the predictor variable/covariate is significant at 1% in the conditional mean model for the corresponding phenotype/trait. Otherwise, it is 0.
  • “I”: Selected and used as an interaction variable in this row.

23.4.7.9 Extra columns for powers of predictor phenotypes, if necessary

If you used predictor variables of powers greater than one, extra columns show for the higher-power phenotypes. They will either show the display starting with “P” (as explained above) or show “0”.

23.4.7.10 Heritability

The heritability of the selected phenotype(s) will appear here.

The heritability is defined as the proportion of phenotypic variance explained by the analyzed marker. A negative sign denotes a negative correlation between the phenotype and the number of transmitted target/disease alleles.

If you selected more than one phenotype, and you also asked for a maximum of more than one phenotype in a group, one column corresponding to each selected phenotype will appear here, and display the heritability whenever the phenotype was involved in the calculations. (“0” will appear for uninvolved phenotypes.)

23.4.7.11 Extra columns relating to interactions, if necessary

If you have selected one or more interaction variables, additional columns will appear in this position. They are:

  • main effect An estimate of the regression coefficient for the main effect.
  • Std error Standard error for the main effect coefficient.
  • p-value P-value for main effect coefficient.
  • interaction An estimate of the regression coefficient for the interaction term.
  • Std error Standard error for the interaction term coefficient.
  • p-value P-value for the interaction term coefficient.
  • FBAT-I The FBAT statistic p-value for interaction term coefficient (analogous to the above p-value for the interaction term, and should have a similar value).
  • h-main The heritability of the main effect.
  • h-interaction The heritability of the interaction.

23.4.8 Output Spreadsheet for Time-To-Onset Analysis

For time-to-onset analysis, the outputs are somewhat different. This output may be divided into the following categories:

  • Row header with marker information
  • Subgroup designation
  • Allele information and genetic model
  • P-value and power values

NOTE: see 23.4.4.5 for the extra outputs added when you perform MFBAT testing.

23.4.8.1 Row header, subgroup, allele, and genetic model information

These are the same as for the non-time-based analysis (expanded format) output described above (23.4.7.2).

23.4.8.2 Time-to-onset p-values and powers

The following statistical outputs are shown for time-to-onset analyses:

  • FBAT-Wilcoxon P-value for the FBAT-Wilcoxon statistic.
  • power Power for the FBAT-Wilcoxon statistic.
  • FBAT-LOGRANK P-value for the FBAT-LOGRANK statistic.
  • power Power for the FBAT-LOGRANK statistic.
  • optimal FBAT-LOGRANK (FH-weights) P-value for the optimal FBAT-LOGRANK statistic (with FH-weights).
  • power Power for the optimal FBAT-LOGRANK statistic (with FH-weights).
  • optimal FBAT-LOGRANK (naive weights) P-value for the optimal FBAT-LOGRANK statistic (with naive weights).
  • power Power for the optimal FBAT-LOGRANK statistic (with naive weights).