‹‹ Back to SVS Home

Importing Copy Number Data

4.4 Importing Copy Number Data


[Picture]
Figure 4.34: The CNAM menu allows you to import data and run the copy number optimal segmenting algorithm

The optional Copy Number Analysis Module (CNAM) is capable of scanning normalized micro array log2 ratio data from both the Affymetrix and Illumina platforms with the object of determining where copy number variation occurs, and then performing association analysis on the log2 ratio data over these regions of copy number variation. Your own log2 ratio data may also be prepared and analyzed by CNAM using the Affymetrix CNT file format (C.4).

The first two steps in performing copy number analysis with CNAM consist of the following:

  • Prepare a “.dsf” file of log2 ratio data from your Affymetrix or Illumina data.
  • Execute the copy number optimal segmenting algorithm on your “.dsf” data file.

Please see 25 for a complete overview and discussion of copy number analysis.

NOTE: These submenu items will only be available if you are licensed to use copy number analysis.

4.4.1 Import Affymetrix Data

For the Affymetrix 500k, SNP 5.0, and SNP 6.0 arrays, CNAM supports reading CEL intensity files and calculating normalized log2 ratios for copy number segmentation in CNAM and association analysis in HelixTree. For details about reading CEL intensity files, see section 4.4.1.1.

For the Affymetrix 10K, 100K, and 500K arrays, you may use the Affymetrix CNAT Batch Analysis tool to create CNT files; or for the 100K, 500K, and SNP 6.0 arrays use Genotyping Console to create CNCHP files. These files contain normalized log2 ratios and can be converted into a DSF file for analysis in CNAM. See Creating CNT Files using the Affymetrix CNAT Batch Analysis Tool and Creating CNCHP Files using Affymetrix Genotyping Console (Appendix C) for instructions on creating these files. Once your files have been parsed to extract the log2 ratio values, the Copy Number Import tool can be run on the resulting DSF file to create a segment means spreadsheet for analysis. For instructions on converting your CNT and CNCHP files to log2 ratio DSF files, see sections 4.4.1.2 and 4.4.1.3.

You may also create a minimal data file with your own normalized log2 ratio data using the Affymetrix CNT file format. This file must contain log2 ratio data and marker map data. See Affymetrix CNT File Format (Appendix C.4) for details. You can then use the Affy CNT Conversion tool to generate a DSF of the data.

4.4.1.1 Import CEL Files

The Affy CEL import tool reads CEL intensity files, normalizes the intensity values against the given reference samples, and imports the normalized log2 ratios to a DSF file ready for copy number analysis in CNAM. The methodology for calculating log2 ratios from the CEL files is described in section 25.11.

The CEL import tool can read CEL files from the following arrays:

  • Affymetrix GeneChipR○ Human Mapping 500K Array
  • Affymetrix GeneChipR
○ Genome Wide SNP 5.0 Array
  • Affymetrix GeneChipR○ Genome Wide SNP 6.0 Array

NOTE: See 25.11 for details on how CNAM reads and normalizes the intensity values from CEL files into log2 ratios.


[Picture]
Figure 4.35: The Affymetrix CEL File Import Dialog in CNAM

To open the Affymetrix CEL file import dialog, click the CNAM->Import Affymetrix->Import CEL Files menu item.

From the CEL import dialog, first select the CEL files you want to include in the data set. For Mapping 500K data, you must select files from both the Nsp and Sty arrays for each sample. To select CEL files, click the Add CEL button and use the file browser to select multiple CEL files. The CEL files you selected will appear in the CEL import dialog window. You may add all of the CEL files in a directory by using the Add Directory button. To remove CEL files from the window, select the unwanted samples and click Remove Selected. You may continue adding CEL files by clicking the Add CEL or the Add Directory buttons again.


[Picture]
Figure 4.36: CEL files selected for conversion into a copy number DSF file

For the import of Mapping 500K CEL files, a matching spreadsheet containing the file names must be available in HelixTree. This spreadsheet will tell the CEL import tool how to join the Nsp and Sty samples together to create one sample per patient in the DSF file. The matching spreadsheet should have a row label column and at least two data columns. The row labels should be the sample names. The first and second columns should be the Nsp file names and the Sty file names that should be joined together. Other columns in the data set are optional but may contain the reference status for the sample.


[Picture]
Figure 4.37: Spreadsheet for matching files from the Nsp and Sty arrays for 500K analysis. Spreadsheet also contains reference status using case/control data.

For Mapping 500K CEL file import, check the 500K NSP/STY Matching check box and select the matching spreadsheet to be used.


[Picture]
Figure 4.38: Selecting a spreadsheet to use for Nsp and Sty array matching

Select a spreadsheet containing the Reference Status for the samples. The row labels should match the sample names. For the SNP 5.0 and SNP 6.0 Array, the row labels should be the file names of the CEL files with the “.CEL” extension removed. 0’s should denote samples to be used as references (controls) and 1’s should denote non-references (cases). The reference status can be any case/control variable for which you want to base the copy number variation. When a spreadsheet is selected, the 0=Ref 1=Non-Ref Column drop down box will fill with the different binary data column names in the selected spreadsheet. Select the name of the column to be used as the reference status.

Note: Affymetrix recommends using at least 25 samples as references in un-paired copy number analysis.

Note: The gender of the reference samples should be considered for copy number analysis of the X and/or Y chromosomes.


[Picture]
Figure 4.39: Spreadsheet and spreadsheet column selected for determining reference status

You also have the option to omit samples with the reference designation from the final LogR DSF. To do this, check the check box labeled Don’t include reference samples in output LogR DSF. If this option is selected, reference samples will be used in normalization of data, and calculation of LogR values, but will not be included in the output DSF.


[Picture]
Figure 4.40: Check box used to drop reference samples from LogR DSF

Next, select a Marker Map to be used in the analysis. Probes that are not contained in the marker map will not be included in the resulting copy number DSF file. For example, if the marker map does not contain copy number probes, those probes will not be in the DSF file for copy number analysis. Select also the Library Path where the CDF library files for the appropriate array can be found. The library files should contain both SNPs and CN Probes.

Note: After using the CEL import tool once for a given array, a directory, AffyLibraryFiles, will be created in the HelixTree installation directory containing *.gcdf library files. These files can be used from that point on instead of CDF files.


[Picture]
Figure 4.41: Marker map spreadsheet selected and library directory where CDF (or *.gcdf) library files are located

You may optionally select a Temp Directory where intermediate DSF files will be stored. If your project is located on a shared network drive, you should specify a Temp Directory on a local disk. Finally select the Output Log2 DSF2 location and click OK to begin the import.


[Picture]
Figure 4.42: Optional temporary directory and output DSF file name selected

The CEL conversion will take several minutes to complete. The DSF file that is created is ready to be imported and analyzed using the Copy Number import tool (see section 25.3).

4.4.1.2 Import CNT Files


[Picture]
Figure 4.43: The Affy CNT import tool

The Affy CNT import tool converts multiple CNT files into one aggregate DSF file that contains the Log2 ratio values in a format ready to be used by the Copy Number Import tool. CNT files can be created for the Mapping 10K, 100K, and 500K arrays or for any copy number data that can be converted into a text file. See C.2 and C.4 for information on creating Affy CNT files.

To open the Affymetrix CNT file import dialog, click the CNAM->Import Affymetrix->Import CNT Files menu item.

From the dialog, you can click Add to select CNT files to convert. This will open a file chooser where you can select one or more CNT files. The CNT files you selected will appear in the CNT file convert window. To remove CNT files from the window, select the unwanted files and click Remove. You may continue adding CNT files by clicking the Add button again. Files cannot be added more than once, but files with the same name stored in different locations may be added to the same import. NOTE: Row labels in the output DSF will be determined by file name, so files with the same name stored in different locations will have the same row labels.

You must also select a location in which to save the output DSF file. Click Browse to open a file chooser and choose a save location.

Once you have selected a group of input CNT files and a output DSF file, click OK to start the import. When the import is complete, the output DSF may be imported using the Copy Number import tool (see section 25.3).

4.4.1.3 Import CNCHP Files


[Picture]
Figure 4.44: The import tool for Affymetrix CNCHP files

The Affy CNCHP import tool converts multiple CNCHP files into one aggregate DSF file that contains the Log2 ratio values in a format ready to be used by the Copy Number Import tool. CNCHP files can be created for the Mapping 100K, 500K, and SNP 6.0 arrays. See C.3 for information on creating Affy CNCHP files.

To open the Affymetrix CNCHP file import dialog, click the CNAM->Import Affymetrix->Import CNCHP Files menu item.

From the dialog, you can click Add CNCHP Files to select CNCHP files to convert. The CNCHP files you selected will appear in the CNCHP file convert window. You may add all of the CNCHP files in a directory by using the Add Directory button. To remove CNCHP files from the window, select the unwanted files and click Remove Selected Files. You may continue adding CNCHP files by clicking the Add CNCHP Files or the Add Directory buttons again. Files cannot be added more than once, but files with the same name stored in different locations may be added to the same import. NOTE: Row labels in the output DSF will be determined by file name, so files with the same name stored in different locations will have the same row labels.

You must also select a location in which to save the output DSF file. Click Browse... to open a file chooser and choose a save location.


[Picture]
Figure 4.45: The CNCHP file parsing tool with selected files

Once you have selected a group of input CNCHP files and a output DSF file, click Import to start the import. When the import is complete, the output DSF may be imported using the Copy Number import tool (see section 25.3).

4.4.2 Import Illumina Data

For the Illumina platform, you must use BeadStudio with the HelixTree DSF Plug-In to export the LogR values from your BeadStudio project. For instructions on how to install and use the plug-in, see Using the HelixTree DSF Export Plug-In in Illumina BeadStudio 3.0 (Appendix D.3). With the DSF Plug-In, you can choose to export the entire project or specific chromosomes.

4.4.3 Appending DSF Files


[Picture]
Figure 4.46: The Append DSF Dialog

Using the append DSF dialog, it is possible to merge multiple DSF files into one larger DSF file which will contain all of the data from the original files. This can be useful if you have data where samples are split between multiple files but the columns are the same across all files. NOTE: For the files to be appended successfully, the original DSF files must all contain the same columns in the same order.

To open the append DSF dialog, click the CNAM->Append DSF files menu item.

You can now add DSF files to be appended. To do this, click Add, and select the files which you would like to use. Selected files will be displayed in the dialog’s list. Files may be added in multiple batches, allowing files from multiple directories to be appended; however, files may not appear in the list more than once. To remove a file from the list, highlight the file and click remove.

You will also need to enter a save location for the output DSF. To do this, click Browse, and select a save location.

When the desired input DSF files and the final save location have been selected, click OK to start the appending.

4.4.4 Directly Import LogR Data


[Picture]
Figure 4.47: The Import LogR DSF dialog.

The Import LogR DSF file allows you to directly import log2 ratio data into a dataset (spreadsheet). You can select only the chromosomes that you would like to import, and it is usually a good idea to limit this to one or two chromosomes as log2 ratio requires a lot of memory to open in a spreadsheet.

4.4.5 Run Copy Number Segmentation


[Picture]
Figure 4.48: The Copy Number Analysis window.

The Copy Number Analysis Segmentation dialog represents the second step for doing copy number analysis with the Copy Number Analysis Module (CNAM), once you have a DSF file of log2 ratio data. It scans this file and uses a segmenting algorithm to discover regions of SNPs in your data that vary significantly from each other.

It then creates a dataset (spreadsheet) in your current HelixTree open project that lists every segment computed, its start and end positions, and the sample mean for every sample. It can optionally create a bookmark file of segment locations for import into another utility such as a genome browser.

To open the Copy Number Segmentation, click the CNAM->Run Copy Number Segmentation menu item.

Please see section 25.3 for a description of how to use this tool and section 25 for a a complete overview and discussion of copy number analysis.