Created:
March 27, 2008

Updated:
July 16, 2008

User Level:

Intermediate

Products:
HelixTree, CNAM, WGA Module


Step 1. Generate Log Ratios Versus Reference Samples

Before you can perform copy number analysis, you first need a DSF file containing log2 ratios (from now on referred to as LogRs) created by normalizing raw intensity data against a reference sample. CNAM offers direct support to create LogR DSF files from Illumina and Affymetrix platforms with additional functionality to create them from other providers. This tutorial will focus on preparing a LogR DSF file from Affymetrix CEL files. To learn how to create DSF files from Affymetrix CNT or CNCHP files, Illumina data or data from other providers, see the following sections in the manual.

›› Importing Affymetrix CNT or CNCHP files
›› Importing Illumina data

The workflow CNAM uses to generate normalized LogRs from Affymetrix 500K, SNP 5.0 and 6.0 CEL files is analogous to the methodology employed by Affymetrix's Genotyping Console 2. However, CNAM can perform quantile normalization without gender bias, scale to handle thousands of samples, and allows greater flexibility in choosing a reference set. To learn more about the methodology CNAM processes Affymetrix CEL files click here.

Preparing Files Needed to Process Affymetrix CEL files

Before CNAM can process CEL files, the following are needed:

  • Spreadsheet matching NSP and STY (for 500K data only)
  • Spreadsheet with CEL file names and column indicating reference status
  • Affymetrix marker maps
  • Affymetrix library files

Spreadsheet matching NSP and STY (for 500K only)
To properly process both the NSP and STY CEL files, a spreadsheet matching the two needs to be imported into HelixTree. This spreadsheet will tell the CEL file import tool how to join the NSP and STY samples together to create one sample per patient in the DSF file.

IMPORTANT: The matching spreadsheet must have a row label column and at least two data columns. The row labels must be the sample names. The first and second columns must be the NSP file names and the STY file names that are to be joined together, respectively. Other columns in the data set are optional and may contain the reference status for the sample. If you include the reference status in this spreadsheet, you can also use this spreadsheet to indicate reference status (see below).

Figure 1. ASCII File Import option with Row Label Column Number
set to 1.

The easiest way to import this file into HelixTree is to create a CSV file with the appropriate columns and then select >File >Import Data >Import ASCII File from the Project Navigator.

IMPORTANT: Make sure to enter the Row Label Column Number representing your sample names.

The imported spreadsheet should resemble the image below.

Spreadsheet for matching NSP & STY arrays
Figure 2.Spreadsheet for matching files from
the NSP and STY arrays for 500K analysis.

Spreadsheet with CEL file names and column indicating reference status
This spreadsheet needs two columns, sample names (as row labels) and reference status. For the SNP 5.0 and SNP 6.0 Array, the row labels should be the file names of the CEL files with the “.CEL” extension removed. “0s” should denote samples to be used as references and “1s” should denote non-references.

Note: It is up to the researcher to finalize a reference strategy. In CNAM you can use any external or internal samples as your reference set. Affymetrix recommends using at least 25 samples as references in un-paired copy number analysis. As discussed later, if using an external reference set, these samples can be dropped from the corresponding DSF file.

Reference Status
Figure 3. Reference status
spreadsheet.

As with the NSP and STY matching spreadsheet above, the easiest way to import this file into HelixTree is to create a CSV file and then select >File >Import Data >Import ASCII File from the project navigator window. Make sure to enter the Row Label Column Number representing your sample names. The imported spreadsheet should resemble the image to the right (Figure 3).

Affymetrix marker maps (annotation files)
You will need an Affymetrix marker map corresponding to the CEL files you wish to import. Probes not contained in the marker map will not be included in the resulting LogR DSF file. For example, if the marker map does not contain copy number probes, those probes will not exist in the DSF file for copy number analysis.

The latest Affymetrix marker maps can be downloaded using the Affymetrix NetAffx service in HelixTree. To access this feature from the Project Navigator window, select >File >Import Data >Download Affymetrix Marker Map. You will be prompted for your Affymetrix NetAffx login information, which can be freely obtained by registering on Affymetrix’s website.

https://www.affymetrix.com/analysis/netaffx/index.affx

After entering your NetAffx login information, the Download Annotations window will appear listing various Affymetrix annotation files (Figure 4).

NetAffx marker map dowload window
Figure 4. NetAffx marker map download
window with both SNP 5.0 marker maps
selected.

IMPORTANT: There are actually two annotation files for each of the 500K, 5.0 and 6.0 files (500K = NSP + STY, SNP 5.0 and 6.0 = SNP + CN probes). For each file set, both corresponding annotation files need to be downloaded at the same time for HelixTree to properly merge them. To do this, highlight both annotation files (as seen in Figure 4), make sure the box Import into project when downloaded is checked and click Download. If you downloaded the annotation files previously, they should show up as a single merged file in the lower section of the window. If this is the case, you can just highlight that file and click Load Into Project.

Affymetrix library files
Similar to Affymetrix marker maps, Affymetrix library files can be downloaded using the NetAffx service in HelixTree. The library files should contain both SNPs and CN Probes when appropriate. To download library files, select the >Tools >Download Affymetrix Library File menu option from the Project Navigator window. After entering your login information, HelixTree will load a list of library files available through the NetAffx service.

To download library files, select one or more files from the upper window, and click Download. The file(s) will automatically be downloaded to the ../ HelixTree/AffyLibraryFiles directory.

Converting Affymetrix CEL files to LogR DSF File

CEL Import
Figure 5. The Affymetrix CEL File Import
Dialog in HelixTree.

From the Project Navigator window, select >CNAM >Import Affymetrix >Import CEL Files.

From the CEL file import dialog (Figure 5), first select the CEL files you want to include in the data set. For 500K data, you must select files from both the NSP and STY arrays for each sample. To select CEL files, click the Add CEL button, navigate to the appropriate folder and select the CEL files you want to process (you can hold down the Shift key to select multiple files at once). The CEL files you selected will appear in the CEL import dialog window (Figure 6). You may add all of the CEL files in a particular directory by using the Add Directory button.

CEL Import
Figure 6. CEL files selected for conversion
into a log2 ratio DSF file.

This is especially helpful if, for example, you store your NSP and STY files in separate directories. To remove CEL files from the window, select the unwanted samples and click Remove Selected. You may continue adding CEL files by clicking the Add CEL or the Add Directory buttons again.

For 500K CEL file import, next check the 500K NSP/STY Matching check box and select the matching spreadsheet previously imported to be used (Figure 7).

Figure 7. Selecting a spreadsheet to use for NSP and STY
array matching.

If you are importing 5.0 or 6.0 CEL files, leave this box unchecked.

Next, select a spreadsheet containing the Reference Status for the samples (Figure 8). When a spreadsheet is selected, the 0=Ref 1=Non-Ref Column drop down box will fill with the various binary data columns in the selected spreadsheet. Select the name of the column that indicate the reference status.

Figure 8. Spreadsheet and spreadsheet column selected for
determining reference status.

NOTE: The effect of gender of the reference samples should be considered for copy number analysis of the X and/or Y chromosomes.

Check the Don’t include reference sample in output Log R DSF box if you are using external reference samples (e.g. HapMap data) and do not want them included in the resulting DSF file.

Next, select the Marker Map previously imported to be used in the analysis and choose the Library Path where the CDF library files for the appropriate array can be found (Figure 9). These reside in the directory where you previously saved them.

Figure 9. Marker map spreadsheet selected and library directory
where CDF (or *.gcdf) library files are located.

NOTE: After using the CEL import tool for your given array (500K, 5.0, 6.0), an AffyLibraryFiles directory will be created in the HelixTree installation directory containing *.gcdf library files. These files can be used from that point on instead of CDF files so you don't have to download them everytime.

You have the option to specify a Temp Directory (Figure 10) where intermediate DSF files will be stored. If your project is located on a shared network drive (not recommended), you should specify a Temp Directory on a local disk. Finally select the Output LogR DSF filelocation and click OK to begin the import.

Figure 10. Optional temporary directory and output DSF file
name selected.

The conversion will take several minutes per CEL file to complete. The DSF file created is ready to be imported and analyzed using either the Copy Number Segmentation tool or the LogR Association Tests and PCA window.

From here you can proceed to Step 2: Identify markers to exclude, Step 3: Correct for batch effects/stratification, Step 4: Perform whole genome log ratio association tests, or Step 5: Run segmenting algorithm.