Setting Options for Tree Analysis


[Picture]
Figure 7.2: The Tree->Options menu Tree tab option

The menu choices Tree->Options opens up a three tabbed dialog that allows us to effect how calculations are performed and what values are displayed in the individual nodes.

7.2.1 The Tree Tab

The tree tabs allows you to set parameters that effect how values are calculated. These values control how nodes are split, how calculations are effected by missing values, etc.

7.2.1.1 Minimum Elements per Child:

This feature counteracts the recursive partitioning tendency to pull off small, possibly outlier groups. When this is set greater than one, splits are computed so all child nodes will have at least that minimum number of observations.

7.2.1.2 Segmenting Algorithm:

This can be either an exact O(n2) algorithm, or an approximate O(n1.5) algorithm. For multi-way splits, computing the segmenting algorithm can be time consuming for large sample sizes. The approximate algorithm may occasionally give a slightly sub-optimal split, but due to its rarity in practice, it is the default.

7.2.1.3 Max Segments:

This is the maximum cardinality of a multi-way split, with a default of 10. There is no limit to the cardinality you may specify for a split. However, most data sets don’t support 50-way splits, so it is often advisable to save computing time and not search through high cardinality splits. The exact and approximate algorithms respectively run in time proportional to O(kn2) and O(kn1.5) where k is the maximum split cardinality.

7.2.1.4 Parallel Threads:

This setting allows the specification of the number of concurrently running threads for calculating splits on multiprocessor machines. If you have a multiple processor machine, this can markedly improve performance on very large data sets.

7.2.1.5 Resampling Iterations

This defines the number of iterations used to estimate p-values using the resampling approach. The smallest bound you can place on a p-value is 1 divided by the number of resampling iterations.

7.2.1.6 P Value Threshold:

The radio button for selecting RawP, AdjustedP or Bonferroni adjustedP determines which type of p-value is used when disallowing any split whose p-value is greater than the threshold p-value, or when the manual split window is determining when to call a split not significant.

In multiway splitting, the Multiway split pairwise P value threshold is also used as the significance level on pair comparisons between adjacent nodes. If any adjacent node in a multiway split is not significantly split according to this threshold, the cardinality of the split is reduced.

Normally, the default is to use the Bonferroni adjusted p-value as the split stopping criteria, which has the lowest chance of over-fitting. However, when doing variable selection, it might be advisable to use the raw or adjusted p as the criteria for stopping splitting, as the Bonferroni adjusted p-value may be over conservative with large numbers of predictors for this purpose.

NOTE: for best prediction performance on multiple trees with the Bonferroni p-value threshold, larger p-value thresholds are better – even as high as 0.99. However if you want to make the threshold lax, set max segments to 2 or 3 . Using the RawP or AdjustedP as the splitting criteria (these are now available) allows tree over building using a less lax p-value threshold than if one uses the Bonferroni threshold.

7.2.1.7 Use Missing Values as Predictors (Impute Missing Data for HTR)

By default this is turned ON and uses missing values to predict the response. In regular splits, this corresponds to using a missing class for double and integer predictors. When this option is turned OFF, missing values are dropped for double and integer predictors. Hence a split may have daughter nodes whose total number of observations do not add up to the parent’s total.

It is not advisable to drop missing values for the multiple tree variable correlation plot, as it will skew the proportions in the trees. For linear regression, this setting has no effect as missing values are never used in the regression, and the value of the parent mean is used as the prediction for those observations that are missing.

For the Haplotype Trend Regression (HTR), the default setting (checked) imputes the haplotype probabilities even for the patients with missing values for genetic markers (whether by EM or CHM).

When this is unchecked, individuals with missing values for their genotypes are excluded from the regression. You may wish to uncheck this box when missing is not at random – for instance where there are more missing values for controls than for cases.

7.2.1.8 Linear/Logistic Regression:

This tells whether to use linear regression fits for continuous and ordinal predictors on continuous and ordinal responses, or whether to use logistic regression fits for continuous and ordinal predictors on a binary response (or a categorical response with just two categories).

With linear regression, a line fits the response in terms of a single predictor, and a p-value is computed for goodness of fit. Instead of splitting the data, a single node is dropped beneath the parent, containing the residual of the fit. If there are missing values in the predictor, the response will be fit as the mean of the parent node for those observations.

With logistic regression, a logistic (sigmoid) curve fits the binary response in terms of a single continuous or ordinal predictor. A p-value is computed for goodness of fit. Instead of splitting the data, a single node is dropped beneath the parent, containing the residual of the fit. Beware, that the residual is continuous, so that subsequent splits are on a (usually bimodal) continuous distribution, which is not always amenable to hypothesis testing based on normality assumptions.

7.2.1.9 Non-Genetic Splits

Sometimes it is desirable to exclude non-genetic variables as predictors in a tree. Turn this flag off to exclude splits of continuous, binary, ordinal, and categorical predictors.

7.2.1.10 Genotype Split

Sometimes it is desirable to exclude genetic variables as splitters in the tree. Turn this flag off to exclude splits of genetic variables. It does not, however, exclude haplotype trend regression.

7.2.1.11 Haplotype Trend Regression

This tells whether to use Haplotype Trend Regression to fit a moving window of haplotypes (derived from a moving window of genetic markers as predictors) to the response using linear regression for continuous and ordinal outcomes, or logistic regression for binary (or categorical with two category) outcomes. The regression matrix is composed of estimates of haplotype probabilities across a set of possible haplotypes, plus a constant term. Like linear regression, Haplotype Trend Regression will create a single node beneath the node being “split” that contains the residual of the fit.

A window size of 1 tests only a single genetic marker, and is an allelic test. With a window size of 2 or more, the regression matrix includes frequencies of haplotypes with the same number of markers as the window size. A window size of 3 will use each marker, and the subsequent two (active) markers–i.e. the two markers located to the right in the spreadsheet view. If markers are inactivated in the spreadsheet, it will skip over them.

Please see 17 (Haplotype Regression and the Allele Table) for a more thorough description of Haplotype Trend Regression.

7.2.1.12 Fixed Window Size

This can only be unchecked if a marker map has been imported from the spreadsheet file menu. This specifies that a fixed number of (activated) markers, which may be specified below in Marker Window Size, should be used for the moving window.

Note that if columns are deactivated in the spreadsheet, the moving window will skip over the missing ones. Hence if you want to look at a particular haplotype of several markers that are not in consecutive columns of the spreadsheet, you can deactivate the intervening ones and build a haplotype composed only of the markers that are active.

7.2.1.13 Ignore Marker Mapping

This option, which only applies when Fixed Window Size is checked, can only be unchecked if a marker map has been applied to the spreadsheet. Checking this specifies that the markers should be used for Haplotype Trend Regression in the order that they appear in the spreadsheet. (Note that in versions of HelixTree 3.1.0 and greater the spreadsheet created from applying a marker map will be sorted in the same order as the marker map). Unchecking this specifies that they should be used in the order specified by the marker map, and that the moving window will not cross from one chromosome to another as specified by the marker map.

7.2.1.14 Window Size Using Genetic Distance of ____ Units

This can only be selected if a marker map has been applied to the spreadsheet. Importing a marker map activates the choice of a fixed window size, or a window size using genetic distance of xxx units, where the number of units is specified in the units of the input marker map. The genetic distance specified will be considered as a maximum. The window used will not cross from one chromosome to another as specified by the marker map.

7.2.1.15 Marker Window Size

This specifies the number of markers in the fixed window, or the maximum number of markers to be used when genetic distance has been used to specify the moving window.

7.2.1.16 Minimum Frequency

This allows you to exclude haplotypes whose frequency is low. This is desirable when you have multiallelic markers, or are using a wide window size and many haplotypes are not represented in the sample, hence creating the possibility of a rank-deficient regression matrix. All haplotypes below the specified threshold will be binned together into a combined group. The default threshold is 0.01.

NOTE: Occasionally it will appear as if HelixTree has skipped over a set of markers with the haplotype regression. If this happens, it is because the regression matrix was rank deficient and the regression could not be solved with the covariate matrix defined by the haplotype probabilities. This happens, for instance, if two columns are linear multiples of each other – something that happens with higher probability when there are a number of columns that are filled almost completely with zeros. By raising the Minimum Frequency threshold high enough, this problem will tend to go away. The downside of raising this threshold is that low probability haplotypes that may be highly predictive of the response get their probability lumped in with the rare haplotypes column, and their signal may be missed.

7.2.1.17 Estimate Frequencies Using CHM or EM

This allows you to select how haplotype frequencies are estimated when performing the haplotype trend regression. Either the Composite Haplotype Method (CHM) or the Expectation/Maximization (EM) algorithm may be specified. Usually EM will narrow down the number of haplotypes that are expected to significantly occur more than the CHM, and thus simplify the regression calculations and results. EM is used as the default method. The CHM is a quick enumeration procedure for haplotype probabilities that does not assume HWE, whereas, the EM assumes HWE. If the HWE assumption fails, CHM-based haplotype trend regression may outperform EM-based haplotype trend regression.

7.2.2 The Node View Tab


[Picture]
Figure 7.3: The Tree->Options menu Node View tab option

This tab allows you to turn ON or OFF the statistical and p values displayed in the tree nodes. To turn ON a value’s display select it using the check box.

7.2.2.1 What the Node Values Mean


[Picture]
Figure 7.4: A picture of a node with many of its values turned ON

The following table explains the meaning of the node symbols:


Symbol Meaning
BP_I Response Variable Name (only shows in root node)
n Node sample size
u Node mean. (For multivariate display, the means of each of the dependent variables will be listed as u1, u2, and so forth.)
s Node standard deviation. (For multivariate display, the standard deviations of each of the dependent variables will be listed as s1, s2, and so forth.)
se Node standard error. (The multivariate display will list standard error for each dependent variable.)
mse Node mean squared error. (The multivariate display will list mse for each dependent variable.)
P The p-value calculated as a T-test or F-test or Chi-squared test, as appropriate. Regression p-values, are computed using a log-likelihood ratio test. This field is only used in “parent” or non-terminal nodes.
aP The p-value multiplicity adjusted for the number of possible cut-points searched through in the case of continuous, ordinal, or categorical predictors. For binary predictors and regression fits, aP=P as there is no multiple testing. This p-value is not adjusted for the multiplicity of independent variables searched through to find the split.
bP The Bonferroni-adjusted p-value of the split. This is the adjusted p-value (aP) multiplied by a Bonferroni correction, which is equal to the number of independent variables that were searched through and that actually could be split. Independent variables used for regression will be counted additionally to those used for segmenting. If an independent variable is used successfully for both regression and segmenting, it will be counted twice. Note that if a given independent variable cannot serve as a splitter because all instances are identical, then that variable will not contribute to the Bonferroni correction. For this reason, the correction factor will either stay the same or get smaller as you go deeper into the tree.
rsP Resampled p-value. This is a p-value calculated by a statistical resampling methodology. This p-value only appears if the user has calculated it using the Resample node menu item.
RsbP Bonferroni-adjusted resampled p-value. This is rsP multiplied by the same Bonferroni correction as is multiplied by aP to get bP.
Nn This is the node identifier. The root node is denoted as “N”. If there were 3 children of the root, they would be denoted N1, N2, and N3. Children of N2 would be denoted N21, N22, and so forth.
I At the bottom right of a node is a little “I”, which if clicked will pop up a text editor to annotate the node. If the tree is saved to disk, the annotations are recovered when the tree is next loaded. Nodes that have annotations will display the “I” in pink, otherwise in black

7.2.3 The Linkage Disequilibrium Parameters Tab


[Picture]
Figure 7.5: The Tree->Options menu LD Parms tab option

7.2.3.1 Linkage Disequilibrium CHM vs. EM

The first radio button, Estimate Haplotype Frequencies Using Composite Haplotype Method specifies that the pairwise marker haplotype frequencies are computed with the Composite Haplotype Method (CHM) approach. CHM is a one-pass enumeration approach that does not assume Hardy Weinberg Equilibrium. More detail on this can be found in Chapter 14.4 in the section on LD Computation.

The second radio button, Estimate Haplotype Frequencies Using Expectation/Maximization specifies that the haplotype frequencies are computed with the EM approach. The EM is an iterative approach based on maximizing a likelihood function. The EM algorithm is described in Chapter 19, Haplotype Frequency Estimation (EM Algorithm).

The Only Output R Squared and D Prime (Quick Mode) check box allows faster LD computation for larger data sets. When this box is checked, computations for P values (and square roots of the computed R2 values) are not done, and those results are not stored in memory. The speed difference will be most noticeable for data sets with many markers. For datasets not based on individual SNPs, the R2 values will be approximate. See 14.4.

7.2.3.2 Hardy Weinberg Correction when Both Markers are Biallelic

In the case when both markers are biallelic, this selection allows a modification to the CHM-based LD algorithm which corrects for deviations from Hardy-Weinberg equilibrium in the two markers. More detail on this can be found in Section 14.4 on LD Computation.

7.2.3.3 Use Patient Data Containing Missing Values

For LD computations, check this box to allow patient data that includes missing values to be included in the data to be processed. If unchecked, patients with missing values will be dropped. For more information on how these parameters are used, see Chapter 14.4’s section on LD Computation and Chapter 19 on Haplotype Frequency Estimation (EM Algorithm).

7.2.3.4 Maximum EM Iterations

To change the maximum number of iterations that will be done to obtain an EM estimate for any pair of markers (for LD) or for any marker window (for HTR), change the Maximum EM Iterations value.

7.2.3.5 EM Convergence Tolerance

To change the maximum amount by which any haplotype frequency value may change in order to consider the estimate to have “converged”, alter the EM Convergence Tolerance value.