The Genetics Menu

The Genetics spreadsheet menu contains items which allow genetic tests and plots, as well as other actions which pertain to spreadsheets containing genetic data.


[Picture]
Figure 6.23: The Genetics menu showing various options

6.6.1 Genetic Association Tests


[Picture]
Figure 6.24: The Association Tests Window

This item opens an options dialog for multiple types of genetic association tests. The dialog can only be accessed if the spreadsheet contains one dependent column which is either case/control (binary) or quantitative (integer or real). For details about running genetic association tests, see Chapter 18, Genetic Association Tests and 18.8 specifically.

NOTE: If you have an integer column which only contains the values 0 or 1, along with missing values, the Genetic Association Test dialog will treat this column as a binary (case/control) column. This allows you to have missing values in your binary data when you import it into the spreadsheet and still analyze it as binary data.

NOTE: These tests will be output into another spreadsheet. If you wish to see the output variables in the form of p-value plots, you may go to that spreadsheet and either plot the individual columns of interest by clicking on their headers (6.2.5), or you may use Analysis->Plot Numeric Columns to plot all of the numeric output variables using one plot window (6.5.5).

6.6.2 Marker Statistics


[Picture]
Figure 6.25: The General Marker Statistics Window

You can perform some general marker statistics on genetic columns with this option. These statistics are also available in the third tab of the Genetic Association Tests window (see 18.10). Please see 18.7 for an explanation of the options for this window.

You do not need to set a dependent variable for this analysis, unless (optionally) you have a case/control variable and wish to have certain marker statistics broken down by “cases” and “controls”.

6.6.3 Principal Component Analysis


[Picture]
Figure 6.26: The Principal Component Analysis Window

You can perform Principal Component Analysis on genetic columns with this option. These statistics are also available in the second tab of the Genetic Association Tests window (see 18.8.3). Please see 18.6.1 for an explanation of the options for this window.

You do not need to set a dependent variable for this analysis.

6.6.4 EM Haplotype Frequency Estimation

This viewer shows the EM Haplotype Frequency Estimation of the markers for the currently selected subset of the spreadsheet. It is covered in detail in Chapter 19, EM Haplotype Frequency Estimation

6.6.5 HWE Plot

This viewer shows the Hardy Weinberg Equilibrium Plot of the markers for the currently selected subset of the spreadsheet.

If the spreadsheet is marker mapped, there will be two options for the HWE Plot, Uniform Scaled and Position Scaled. In the Uniform Scaled plot, the markers will be evenly spaced along the horizontal axis. In the Position Scaled plot, the markers will be spaced relative to their chromosome positions.

The Hardy Weinberg Equilibrium Plot is covered in Chapter 15.1.1.

6.6.6 LD Plot

This viewer shows the Linkage Disequilibrium Plot of the markers for the currently selected subset of the spreadsheet.

If the spreadsheet is marker mapped, there will be two options for the LD Plot, Uniform Scaled and Position Scaled. In the Uniform Scaled plot, the markers will be evenly spaced along the horizontal axis. In the Position Scaled plot, the markers will be spaced relative to their chromosome positions.

See Chapter 14.1 Plotting Linkage Disequilibrium.

6.6.7 PBAT Family-Based Analysis

This menu item will be available from any Pedigree spreadsheet (a spreadsheet that has been imported from a pedigree file). To perform a PBAT family-based analysis:

  • Activate the columns for all of the markers you wish to analyze. Deactivate all of the other genetic columns. (Hint: You may wish to use Edit->Column->Inactivate All first, then activate the markers you wish to analyze.)
  • Select this menu item. A dialog will come up, allowing you to select your PBAT options.

See Chapter 23 for details on PBAT family-based analysis.

6.6.8 Runs of Homozygosity

This feature calculates Runs of Homozygosity for the currently selected subset of a marker mapped spreadsheet. It is covered in detail in Chapter 21, Runs of Homozygosity. This menu item will be available if the spreadsheet is marker mapped.

6.6.9 Infer Missing Genotypes

Choose this to form a new spreadsheet with inferences for as many missing genotype values filled in as possible using the missing-value extension to the Expectation-Maximization (EM) algorithm. The parameters that HelixTree will ask, and the recommended values for them, are as follows:

  • Window Size for Finding Markers with Highest LD (Perhaps 100. Using a marker mapped spreadsheet should make finding more markers with higher linkage disequilibrium (LD) easier.)
  • Number of Markers for the EM Algorithm (Five to ten.)
  • Minimum Acceptable Diplotype Probability for an Inferred Value (Use .99 to ensure that a higher percentage of inferred values are correct. Use a lower number, .5 or zero, to maximize how many inferences are made.)
  • Maximum EM Iterations (20 to 100. More for a larger number of markers.)
  • EM Convergence Tolerance (.001 to .0001. Smaller for more markers.)

The prevailing project options for computing LD will be used. (If you actually limit the window size down to the number of markers, the finding-best-LD step will be skipped.)

If you have deselected any rows or columns, a message window, which comes up before the above parameter selection, allows you to choose whether to infer missings from the whole spreadsheet or just the rows and columns of the spreadsheet which have been selected or made dependent. (Non-selected columns, but not non-selected rows, will be used in the inference process.)

Another message window may remind you of the possible better results you may get using a mapped spreadsheet.

When the new spreadsheet is created, statistics (how many missings were inferred out of how many total missings), along with all of the parameters that were used to create the new spreadsheet, are written to the new spreadsheet’s annotation window.

The mechanism by which this works is that after the markers with the best LD to the given marker are found (if applicable), HelixTree’s EM capability determines the haplotypes over these markers, always using patient data which has missing values.

Then, for each patient with a missing value in the given marker, the values for that genotype implied by the patient’s diplotypes are examined. The probability of each potential genotype is inferred to be the same as the probability of the diplotype containing it (or to be the sum of the probabilities of all diplotypes containing it). The potential genotype with the highest probability is tentatively selected, and if that probability is at least the “Minimum Acceptable Diplotype Probability for an Inferred Value”, the patient’s missing value is inferred to be that genotype.

Please see 19 and [Chiano 1998] for more details.