‹‹ Back to SVS Home

Identity by Descent Estimation

7.11 Identity by Descent Estimation

Overview

Identity by Descent (IBD) is a measure of how many alleles at any marker in each of two individuals came from the same ancestral chromosomes. (This is in contrast to the Identity by State (IBS) measure, which is simply a measure of how many alleles at any marker in each of two individuals happen to be the same, for whatever reason.) IBD is therefore a measure of the relatedness of the pair of individuals in question. For instance:

  • The alleles of identical twins should come 100% from the same ancestral chromosomes, because they have the same chromosomes.
  • The alleles of siblings should come approximately 50% from the same ancestral chromosomes.
  • The alleles of half-siblings should come approximately 25% from the same ancestral chromosomes.
  • The alleles of unrelated individuals should not come from the same ancestral chromosomes at all, or in other words approximately 0% from the same ancestral chromosomes.

Meanwhile, it is possible for genotyped samples to exhibit apparent relatedness that has nothing to do with the relatedness or lack of relatedness of the corresponding individuals. For instance:

  • Duplicate samples will exhibit alleles coming 100% from the same chromosomes.
  • In a dual-array system such as the Affy 500K, duplicate samples from one of a pair of genotyping chips but not the other one will exhibit alleles coming 50% from the same chromosomes.
  • Sample contamination will show as one individual seeming to have relatedness to many other individuals.

SVS allows estimation of the Identity by Descent between all pairs of samples, based on the data in your genotypic spreadsheet.

  • NOTE: It is recommended that IBD estimation in SVS should be used for data quality control, rather than for actually attempting to impute relatedness among individuals whose samples you are analyzing.
  • NOTE: It is usually advisable to apply LD pruning (Quality Assurance > Genotype > LD Pruning from the spreadsheet menu) before using this feature.
  • NOTE: You will obtain the best values when you use many samples and many markers. This is due to the need to estimate allele frequencies over multiple samples, as well as the need to estimate IBD itself over multiple markers.
Data Requirements

First, import your data into a SVS project (See Importing Your Data Into A Project) to create a genotypic spreadsheet. The samples in your spreadsheet are required to be row wise, and only the autosomal genotype columns should be active. (If necessary, use Select > Activate by Chromosomes from the spreadsheet menu.) The IBD dialog can be accessed by selecting Quality Assurance > Genotype > Identity by Descent Estimation from the spreadsheet menu.

Values Computed

The first outputs computed are the initial computations for the respective probabilities that zero, one, or two alleles are identical by descent (shared IBD). These are designated P(Z=0), P(Z=1), and P(Z=2), respectively.

Using your genotypic data, SVS will “work backwards” to impute the most reasonable genome-wide IBD probabilities from your data, assuming it came from a homogeneous, random-mating population. For each of your markers, the allele frequencies are estimated. Using these frequencies, P(I = i|Z = z) is estimated for each combination of i, an IBS state, and z, a possible IBD state. For instance, if p and q are the actual respective allele frequencies of the two alleles in a marker, P(I = 0|Z = 0), the probability of having an IBS state of zero (completely different alleles) between two individuals given an IBD state of zero (completely different alleles by descent) between those same two individuals should be 2p2q2. (This reflects both individuals having opposite homozygotes two different ways, AA and aa, or aa and AA, each with probability p2q2.) Since allele frequency estimates are made from the spreadsheet data, a correction factor is actually used to obtain unbiased estimates of P(I = i|Z = z), but the results are similar to what would otherwise be obtained.

Estimating these probabilities allows incrementing the expected count of markers with IBS state i, conditioned on IBD state z, for each pair of samples.

After all markers are scanned, a method of moments is used to find, from the expected counts and actual counts of the different IBS states, global estimates for P(Z=0), P(Z=1), and P(Z=2) for each sample pair. In some cases, these values will not be in the range of zero to one–in these cases, values are corrected appropriately to be in the range zero to one before they are output by SVS.

The overall fraction of alleles which are shared IBD between two individuals over the genome may be summarized by the one value

π = P-(Z-=-1)+ P(Z = 2),
       2

or half of the probability of sharing a single allele IBD plus the probability of sharing both alleles IBD.

It would be expected that the probability of sharing two alleles IBD would be less than the probability of picking one allele shared IBD multiplied by the probability of picking a second allele shared IBD between the same two individuals. If this is not so, namely,

π2 <= P(Z = 2),

a set of transformed probabilities is computed which are more biologically plausible, as follows:

P ∗(Z = 0) = (1 − π )2,
P ∗ (Z = 1) = 2π(1− π),
and
P ∗ (Z = 2) = π2.

Otherwise, the values labeled P* for the pair of individuals will be copied from the initial estimates (P).

The complete algorithm used by SVS is spelled out in [Purcell 2007].

Using IBD Estimation

Select the computation parameter (if applicable) and output options and select the Run button to process. Descriptions of the computation parameter and output options are detailed below.

One or more spreadsheets of results will be created as children of the current spreadsheet navigator window node. Information about the parameters used will be recorded in the Node Change Log.

Parameters

Allele Counts

If your spreadsheet is a pedigree spreadsheet, you may check Use only founders for allele counts (default for pedigree spreadsheets) to count alleles only from samples which contain missing values for the Father ID and the Mother ID. On the other hand, you may leave this box unchecked to count alleles from all samples to determine allele frequencies, which is what is done for a non-pedigree spreadsheet.

Identity by Descent Estimation Outputs

The following outputs may be checked or unchecked:

  • Output untransformed estimates of P(Z=0), P(Z=1), and P(Z=2) (three spreadsheets) (selected by default)
  • Output PI = P(Z=1)/2 + P(Z=2) (one spreadsheet) (selected by default)
  • Output transformed estimates P*(Z=0), P*(Z=1), and P*(Z=2) (three spreadsheets)

All of these outputs are in the form of a spreadsheet with both rows and columns corresponding to the samples, with each cell representing the IBD value between the two samples represented by its row and its column.

The reason these outputs are in this form is to allow you to view them using the heat map feature of SVS. You may then easily pick out any pair of duplicate samples, or one sample contaminating a number of other samples, or other suspicious values of estimated IBD.

To view a spreadsheet as a heat map, select Plot > Heat Map from the spreadsheet menu.

Additional Outputs

To get a listing of all pairs of samples whose IBD PI estimate is at or above a certain value, check Output all pairs where PI >= (value), and input the value to use.

This listing will output, in one spreadsheet, one row for every pair of samples meeting the criterion above. The sample pair will be output along with all of the pair’s IBD values.