Inferring Missing Genotype Data
HelixTree can infer a large percentage of missing SNP's in a spreadsheet by using the extension of the Expectation-Maximization (EM) algorithm, which uses patient data containing missing values. Because of LD correlation, we can infer missing values from neighboring markers. For each marker with missing values we:
- Select the k highest LD markers within a window of w markers centered about the marker of interest
- Compute a k-marker haplotype imputing missing values. (Chiano M. N., Clayton D. G. (1998) Ann. Hum. Genet. 62, 55-60 .)
- If the genotype can be assigned with probability above some specified threshold, assign it.
We tested this feature on an Affymetrix 500k dataset and used some very stringent criteria for making calls. With a runtime of ~25 min., we were able to recover +/- 18% of missings with a probability of false assignments <.0025, which was lower than the experimental error.

Figure. Inferring missing genotype data dialog box.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |





