I recently gave a webcast on GWAS in a model organism: Arabidopsis thaliana; a question was brought up about the differences between EMMA and EMMAX and why the results with each would differ.
Both of these algorithms are used in association studies to account for data sets that show population stratification and relatedness. However, there is a slight difference between how they are run. EMMA is compatible with datasets with a moderate to small n, but the computational power required for datasets with a large n is astronomical. This is because EMMA calculates variance components for the reduced model plus the marker being tested, meaning each marker has its own variance component calculation. EMMAX being the expedited version of the algorithm assumes that the effect of each SNP on the trait is small; the variance component is only for the reduced model and thus has one variance component calculation for the whole run. It’s this simplification in EMMAX that may result in different results depending on the dataset, for example when the associated SNP has a larger effect on the phenotype, which violates the assumption of EMMAX, this may yield conservative p values and lead to a loss of statistical power.
In conclusion, EMMA and EMMAX differ in that EMMA calculates its own variance component for each marker. Whereas EMMAX assumes each SNP effect is small and therefore the whole run has one variance component calculation.
Ref: Zhou and Stephens 2012. “Genome-wide Efficient Mixed Model Analysis for Association Studies.” Nature Genetics 44(7): 821-824.
Thanks – then does it make sense to run post-hoc run EMMA on large effect size variants, perhaps by running a traditional PLINK PCA corrected association model to select them, running EMMAX on all variants and EMMA on the high effect size ones?