Workflow for Reading Affymetrix CEL Files
CNAM in conjuction with HelixTree provides a complete set of workflows for whole genome copy number association that starts with Affymetrix Mapping Array CEL files, creates normalized log2 ratios, finds regions of copy number variation, and performs association testing on the CNV variants.
The workflow to generate normalized log2 ratios is very analogous to the methodology employed by Affymetrix [Affymetrix 2007], scaled to handle thousands of cases and controls. The steps are as follows:
- Depending on the Mapping array type, there are anywhere from 1 to 40 probes used to interrogate a given
genotype. The Affymetrix 500k mapping array provides perfect match and mismatch probes, whereas the Affy
5.0 and 6.0 chips only include perfect match probes. We only use the means of the perfect match probes.
- For Affy 500k, the NSP and STY A and B probe intensities are extracted separately. For a given marker (SNP or CNV) there may be anywhere from 1 to 40 probes. These are averaged to get a probe intensity per marker.
- For Affy 5.0 and 6.0, the polymorphic SNP probes and the non-polymorphic copy number probes are extracted separately. The non-polymorphic probes only have an A intensity.
- The A and B probe intensities are quantile normalized per sample using the approach of [Bolstad 2001, Bolstad 2003].
500k NSP and STY samples are quantile normalized separately, as are the Affy 5.0 and 6.0 polymorphic and
non-polymorphic probes. The process is as follows:
- For each of the autosomal A and B probes: sort the intensities for each sample in ascending order.
- Replace the smallest value in each sample with the mean of the smallest values, the second smallest value in each sample with the mean of the second smallest, and so on for the entire set of probes.
- Reorder the modified A and B intensities into their original order for all samples.
- Calculate modified intensities for the non-autosomal (sex) probes by finding the closest autosomal intensity value and substituting its corresponding quantile-normalized intensity.
- Calculate a reference distribution and calculate log2 ratios:
- Select a set of samples for the reference distribution, for instance the controls. There is an option under development to only include females as a reference for the X chromosome probe intensities.
- For each polymorphic probe, i, calculate the median of the quantile normalized A intensities and the
median of the quantile normalized B intensities, Ai,med Bi,med for the reference samples. Then, for a
given pair of probe intensities Ai and Bi , the normalized copy number signal is the log2 ratio of the sum
of the Ai and Bi probe intensities to the median A and B probe intensities:

- For non-polymorphic probes, there is no B intensity, but the analogous normalization is performed:

- Join together distributions:
- Recall, for the 500k, the NSP and STY log2 ratios are calculated separately. Also, for arrays containing non-polymorphic copy number probes, these must be joined with the polymorphic probes. We join different arrays together by the “virtual array generation” procedure outlined in section 7 of [Affymetrix 2007].
- The Affymetrix procedure defines a log2 ratio range for defining which markers are copy-neutral. Rather than doing this, we sample from the middle 1/3 of the distribution of autosomal log2 ratios, but follow the same procedure of centering the copy number log2 ratios about zero by subtracting the mean of the copy-neutral markers, and scaling the distributions by their respective signal to noise ratios.