Composite Haplotype Method (CHM)

A common type of genetic data is genetic information independently scored at several markers along a chromosome. Each subject has two copies of the same chromosome, one chromosome from the mother, the other one from the father. It is not clear which alleles at different markers reside on the maternal, and which are on the paternal copy of the chromosome. Only individuals that are heterozygous at most at a single marker can be resolved into a pair of haplotypes unambiguously. This problem, called “genetic phase uncertainty” is an example of a more general problem of statistical inference in the presence of missing data. The expectation-maximization (EM) algorithm, formalized by [Dempster 1977] is a popular iterative technique for obtaining maximum likelihood estimates of sample haplotype frequencies (see Excoffier and Slatkin, 1995 for details of obtaining haplotype frequencies by EM [Excoffier 1995]).

Consider the following simple example. Suppose we observe a sample of 6 individuals typed at two bi-allelic markers. The first three individuals are double homozygotes, AA/BB, contributing six gametes of the type AB to the sample counts. The next two are aa/bb, contributing four gametes of the type ab. The last individual, however, is a double heterozygote, Aa/Bb, and there are two possible pairs of gametes, AB with ab, or Ab with aB. Taking into account the first part of the sample, the EM algorithm assigns an extremely low probability to the second, (Ab with aB) arrangement, because these types of gametes have not been observed among unambiguous individuals at all. To make such a calculation, it is necessary to assume “Hardy Weinberg Equilibrium” (HWE) over the two markers, which means that the two-marker genotypes do not deviate from the products of the haplotype frequencies. On the other hand, the CHM method described below does not assume HWE and maintains that either of the two arrangements (AB with ab, or Ab with aB) is equally likely.

Certainly, there will be discrepancy between sample frequencies generated by the two methods. Our goal, however, is not simply to report, but to contrast haplotype frequencies between levels of phenotype. Assume for the moment two markers with two alleles and that the phenotype is binary, i.e. the subjects can be classified as either “cases” or “controls”. For simplicity, assume that the haplotype AB determines the probability of an individual being a “case” and individual alleles are unimportant. Then the expected allele frequencies (pA,pB) are the same among cases and controls. We can therefore express the expected frequency of AB haplotypes in cases and controls as

Cases :    pAB = pApB +D1
Controls : pAB = pApB +D2


where D1, D2 are linkage disequilibrium (LD) coefficients. These coefficients are needed to account for the part of gametic frequency not explained by frequencies of alleles, pA,pB. When we are comparing pAB between cases and controls, we are essentially looking at the size of the difference between D1 and D2. The EM algorithm attempts to obtain estimates of pAB in cases and controls separately, assuming HWE in these groups. However, even if HWE holds in the population, it is expected not to hold in either cases or controls looked at separately [Nielson 1998], which may lower the power of the test. Moreover, cases and controls exhibit the deviation from HWE in different directions [Zaykin 2000], which is exploited by CHM. The CHM estimates a quantity different from pAB, specifically

Cases :   ϕAB = pApB + D1 + pA∕B
Controls : ϕAB = pApB + D2 + pA∕B


The last term, pA∕B, is the frequency of A, B alleles that reside on two different gametes (Weir, 1996), in contrast to pAB, that measures their joint frequency on the same gamete. This “intra-gametic” frequency can also be written as a product of A, B allele frequencies plus the deviation unexplained by the product. Generally, this deviation is not zero if the HWE at the haplotype level does not hold. It is also expected that this deviation is generally larger in cases than in controls (denote the difference by δ), when cases are less prevalent category in a population. Therefore, when comparing composite frequencies ϕAB between cases and controls, we are looking at the size of the difference D1 - D2 + δ. The term δ is zero only under the multiplicative model of haplotype action, in which case two methods are comparing the same quantities.

In practice, we have found that more often than not the EM and CHM result in essentially the same p-values for tests of association with the phenotype, although the CHM can be much more powerful under models that include large non-additive components. Compared to EM, it is a non-iterative and much faster algorithm, posing no danger of convergence to a local maximum.

26.7.1 Technical details

The CHM is based on the idea of the genotypic LD coefficient, ΔAB, [Weir 1996]. Estimation of ΔAB involves calculation of di-genic frequencies. In the two-locus bi-allelic case, they are estimated as

1η   =  2nAABB--+-nAABb-+-nAaBB-+nAaBb-∕2,
n AB                   n

where nAaBb, for example, is the number of individuals with genotype Aa/Bb, and n is the sample size. The composite disequilibrium is defined as a sum of inter- and intra-gametic components,

ΔAB  = DAB  +DA ∕B = PAB + PA∕B - 2pApB.

Under random mating, PA∕B = pApB, and so ΔAB is an unbiased estimate of the LD parameter, DAB. For our purposes it is not necessary to separate the inter- and intra-gametic components and we work in terms of

ρAB =  PAB-+-PA∕B.
           2

[Zaykin 2001] extended the definition of di-genic frequencies to multiple loci and alleles. For the i-th individual multilocus genotype gi, let H(gi) be the number of single—locus heterozygotes in gi. Define weights as

w(gi) =-H(1g)-1 = 21- H(gi).
       2   i

Sample composite haplotype counts are calculated from summing over individual contributions,

       ∑n
ηabc,...=    w (gi)I(a,b,c,...⊂ gi),
       i=1

where n is the sample size, and I() is the indicator function, defined as

I(a,b,c,...⊂  g) = { 1 ifi- th individualgenotypegi hasallelesa,b,c,...
            i     0  otherwise

Thus, if the i-th individual has at least one copy of all required alleles, it is counted with weight w(gi). The composite haplotype frequencies are given by

ρabc...= -1ηabc....
       2n

In a two-locus, two-allele case, composite haplotype counts simplify to Weir’s definition, ηAB = 2nAABB+nAABb+nAaBB+nAaBb2. In a single-locus case, they are the usual definition of the allele count:

ni = 2nii + ∑ nij.
          i⁄=j

26.7.2 Haplotype Trend Regression (HTR)

The motivation for HTR is to provide a unified approach for testing association of haplotype frequencies with a discrete or a continuous phenotype. The HTR, developed by [Zaykin 2001], fits a model of additive effects of haplotypes. Under the hypothesis of no haplotype effects and under the assumption of Hardy-Weinberg Equilibrium (HWE) the HTR is closely related to the previously suggested likelihood ratio test (LRT) for the binary phenotype, described in [Xie 1993Fallin 2000Zhao 2000]. The difference occurs upon the situation of interest, when haplotypes have an actual effect on the phenotype, and the assumption of HWE in a particular category of response is no longer valid [?]. In this situation the HTR provides a more powerful test [Zaykin 2001]. This is the consequence of the fact that the LRT operates on haplotype counts that add up to twice the sample size, which implicitly assumes HWE.

To explain the HTR method simply, consider a simple example of three individuals with response values denoted as Y1 , Y2 , and Y3 that have unambiguously resolved haplotype pairs h1/h1, h2/h3, and h1/h3. A table of haplotype frequencies vs. individuals would then look like:

      h1   h2   h3

Y1     1   0    0
Y2     0  1∕2  1∕2
Y3    1∕2  0   1∕2

and the linear regression equation E(Y) = Dβ would then become

                                 (    )
  (  Y1 )   ( 1   1    0    0  ) |  μ |
E (  Y2 ) = ( 1   0   1∕2  1∕2 ) | β1 |
     Y3       1  1∕2   0   1∕2   ( β2 )
                                   β3

where the β1, β2, and β3 are coefficients corresponding with haplotypes h1, h2, and h3.

In the case where haplotypes are statistically inferred, the entries in the matrix D will be the inferred conditional probabilities of haplotypes given the genotype. Specifically, for haplotypes h2 and h3, the conditional frequency (h2, h3) for the i-th individual with genotype Gi is

                 Pr(Gi|h2,h3)ph ph
P r(h2,h3|Gi ) = ∑--------------2-3---
                u,v Pr(Gi|hu,hv)phuphv

wherephu and phv denote haplotype frequencies, and each haplotype in a pair is equally likely, i.e.

                       1
P r(hu|Gi) = P r(hv|Gi) = 2Pr(hu,hv|Gi ).

These probabilities correspond to columns 2 to 3 in the matrix D above. The probability Pr(Gi|hk,hj) is either zero or one, and so the D values for haplotypes incompatible with the i-th subject genotype are equal to 0, and they are equal to 12 or to 1 otherwise.

In the case when haplotype frequencies phkphj are statistically inferred, the entries in D corresponding to haplotypes are no longer 0, 12, and 1, except for homozygous and single heterozygous subjects, reflecting the haplotype assignment ambiguity (see “Composite Haplotypes Method” section) 26.7. The test for association of haplotype with trait is the test of H0 : β1 = β2 = β3 = 0, using the chi-square statistic from fitting a logistic regression in the case of the binary phenotype, or the F statistic from fitting a linear regression for the case of the continuous phenotype. The p-value for the continuous phenotype (linear regression) with n observations is obtained as follows:

        ∑
redss = (-ni=1yi)2,
tcss = ∑n  ny2- redss,
mss = ∑ni=1yiβTx , whereβ isthe regression matrixand x are thecoefficients,
regss = mia=x1(imss i- redss,0),                     i
errss = tcss- regss,
reg_df = k - 1, wherek isthenumber of haplotypes,
err_df = n - reg_df - 1, and
              (regss×err_df            )
p = aP = Ftest errss×reg_df,reg_df,err_df  .

(Except for those circumstances where mss is less than redss, this is equivalent to the F-test for linear regression

              (    2                       )
p = aP = Ftest  --R--×2-err_df---,reg_df,err_df  ,
                (1 - R )× reg_df

where R2 is the coefficient of determination for the linear regression.)

In the simulated example below we use the included data file “data_312.ghd”. The goal is to see if a gene can be located by testing for association with a set of markers that surround the gene. The 200 markers are single nucleotide polymorphisms, each being one of homozygous 0/0, 1/1 or heterozygous 0/1. If we use regular recursive partitioning, we can subset the population by each gene one of four ways: {{0/0}{0/1} {1/1}}, {{0/0, 0/1} {1/1}}, {{0/0, 1/1} {0/1}}, {{0/0} {0/1, 1/1}}. We see below at left a log plot of the p-value of the optimal split for each marker versus marker location. We see that the signal around marker 22 cannot be pulled out by usual recursive partitioning methods. However, when we perform haplotype trend regression, with a moving window of 5 genetic markers (and estimating the haplotype frequencies using the CHM method), we see in the plot at right that the signal around marker 27 is highly significant, with p< 10-14. We see that single marker effects do not carry enough information to describe the underlying genetic variation, but that HTR has the power to do so.


[Picture]
Figure 26.1: Plotting P value with recursive partitioning.


[Picture]
Figure 26.2: Plotting P value with Haplotype Trend Regression

26.7.2.1 Left-Out Haplotypes

The set of all possible haplotypes for a given set of markers essentially forms a number of categories. Whether a given chromosome of any given patient has a haplotype may be thought of as a dummy variable being one or zero for the “category” corresponding to that haplotype. Therefore, just as one category is left out of the assignment to dummy variables when a typical categorical variable undergoes regression, HelixTree leaves out at least one of the possible haplotypes from a haplotype trend regression, since the value of any haplotype’s frequency is determined by the frequency values of all of the other haplotypes.

Even though one patient may have up to two different haplotypes, and even though HelixTree assesses frequencies for the various haplotypes, HelixTree still leaves out at least one of the haplotypes from the analysis, because there is a co-dependency of any haplotype’s frequency on that of all of the others.

26.7.2.2 Rare Haplotypes

If no haplotype is determined to be “rare”, that is, all haplotypes have been determined to have a frequency of at least the Minimum haplotype frequency that may be specified in the options, then the last haplotype (alphabetically sorted) is (arbitrarily) left out.

However, if one haplotype is determined to be “rare”, that is, exactly one haplotype has been determined to have a frequency of less than the Minimum haplotype frequency that may be specified in the options, that haplotype will be left out of the regression analysis. This is because the frequency estimated for that haplotype is presumably less reliable than the frequencies estimated for the other haplotypes.

If more than one haplotype is determined to be “rare”, then all such haplotypes will be lumped into one category, called “Rare Haps”, and that category will be left out of the regression analysis. This is for the same reason–namely, that the frequency estimated for that category is presumably less reliable than the frequencies estimated for the other categories (haplotypes).