1. Home
  2. SVS
  3. SVS Tutorials
  4. Advanced Tutorials
  5. Creating and Using Genome Assemblies in SVS Tutorial

Creating and Using Genome Assemblies in SVS Tutorial

Welcome to the Creating Genome Assemblies in SVS Tutorial!

_images/plot-weight.png

Updated: January 28, 2021

Level: Advanced

Packages: All Packages of SVS

Currently there are several available genome assemblies within SVS, including the human, cattle and soybean genomes. If you go to Tools >Manage Genome Assemblies and select Download from Golden Helix you will see the assemblies that are currently available for use in SVS.

But what if you are studying Zebrafish and you find that there is no genome assembly available for the most recent build (Zv9) of the species Danio rerio (Zebrafish)? Well if you have the necessary information available or you are willing to locate it independently, you will find that it is simple and straightforward to create your own genome assembly in SVS. Keep in mind that locating the information can be difficult and there is no hard and fast rule to accomplish this. Let’s first work through the process using the Zebrafish research scenario, then we will look at how you can create your own ‘fake’ genome assembly for any grouping variable.

Requirements

To complete this tutorial you will need to download the following:

SVS_Genome_Assembly_Tutorial.zip

We hope you enjoy the experience and look forward to your feedback.

1. Create a Genome Assembly for Danio rerio

To create a genome assembly for any species you need a listing of the assembled chromosomes and/or scaffolds along with their corresponding lengths for the particular genome build.

NOTE: If you will also be creating a Reference Sequence for your species, the assembly file can be created automatically in SVS by the Covert Source Wizard, using the FASTA reference file to determine the lengths of each defined segment. You can skip to Part 2 of this tutorial for an example.

If you were to google ‘Zebrafish genome’ you may find the Web page http://uswest.ensembl.org/Danio_rerio/Info/Index. If you click on the More information and statistics link under the Genome assembly: Zv9 header and then click the GenBank Assembly ID GCA_000002035.2 link, you will be directed to the NCBI website for this genome assembly. Under the Assembly Statistics tab you will see a listing of the assembled chromosomes with the corresponding lengths for this species.

Figure 1. Danio rerio genome
Figure 1-1: Danio rerio genome
  • Open a new project in SVS or open a project you already have that contains genomic information for Zebrafish. From the project navigator, choose Tools > Manage Genome Assemblies or type Ctrl + G. Click User Genome Assemblies Folder. If you have not created any assemblies the folder should be empty.
  • Right-click in the empty folder and select New > Text Document to create an empty text file and name it appropriately with species name and genome build. For this example choose the name Danio_rerio_Zv9.assembly. Open the text file that you just created.
  • The header lines of the file are a summary of the genome build information for this species along with relevant build dates and should look as follows:
{
  "coordinates" : "Zv9,Chromosome,Danio rerio",
  "build" : "Zv9",

  "common" : [ "Zebrafish", "bony fishes" ],
  "taxID" : "7955",
  "genBankID" : "GCA_000002035.2",
  "refSeqID" : "GCF_000002035.4",
  "date" : "2010-07-15",

  "modified" : "2013-12-30",
  • The value of the build attribute is a way for SVS to refer to your genome assembly within the program. The name should be unique as it is generally only used for disambiguation when multiple genome assemblies apply to the same coordinate system. The value of the coordinates attribute is used to identify the coordinate system your genome assembly refers to. The value is formatted according to the Distributed Annotation System (DAS) specification in three parts separated by commas: authority,type,species. The authority in this case matches the genome build, Zv9. For best results you should find the appropriate entry in the DAS Registry coordinate system list.

NOTE: If the assembly you are working with is not represented in the list, something reasonable can be invented. Keep in mind that if the value of the coordinates attribute is user-invented, it may not match annotations and other data available elsewhere.

  • The optional entries for commontaxIDgenBankID, and refSeqID may be provided to help SVS categorize your genome assembly. Multiple common entries may be provided with different values for different common names for the species. The taxID refers to the unique Taxonomy ID which is assigned to each genome build and can be found for the zebrafish assembly by clicking the Taxonomy link under the Related Information header on the right side of the NCBI assembly page. You will be directed to the Taxonomy Browser website.
  • The date entry provides the release date for the represented assembly and the modified date represents the current date. These values must be formatted YYYY-MM-DD.
  • The remainder of the assembly file contains segment information–this will be either chromosome or scaffold information depending on the species. For the case of the Zebrafish genome there will be a row listed for each chromosome along with its corresponding length in base pairs, as shown earlier from the main NCBI webpage for this genome build (Figure 1-1). Add the chromosome information with the correct formatting including all commas, quotation marks and brackets as follows:
{
  "coordinates" : "Zv9,Chromosome,Danio rerio",
  "build" : "Zv9",

  "common" : [ "Zebrafish", "bony fishes" ],
  "taxID" : "7955",
  "genBankID" : "GCA_000002035.2",
  "refSeqID" : "GCF_000002035.4",
  "date" : "2010-07-15",

  "modified" : "2013-12-30",

  "segment" :
  [
     { "name" : [ "1" ], "length" : 60348388, "type" : "autosome" },
     { "name" : [ "2" ], "length" : 60300536, "type" : "autosome" },
     { "name" : [ "3" ], "length" : 63268876, "type" : "autosome" },
     { "name" : [ "4" ], "length" : 62094675, "type" : "autosome" },
     { "name" : [ "5" ], "length" : 75682077, "type" : "autosome" },
     { "name" : [ "6" ], "length" : 59938731, "type" : "autosome" },
     { "name" : [ "7" ], "length" : 77276063, "type" : "autosome" },
     { "name" : [ "8" ], "length" : 56184765, "type" : "autosome" },
     { "name" : [ "9" ], "length" : 58232459, "type" : "autosome" },
     { "name" : [ "10" ], "length" : 46591166, "type" : "autosome" },
     { "name" : [ "11" ], "length" : 46661319, "type" : "autosome" },
     { "name" : [ "12" ], "length" : 50697278, "type" : "autosome" },
     { "name" : [ "13" ], "length" : 54093808, "type" : "autosome" },
     { "name" : [ "14" ], "length" : 53733891, "type" : "autosome" },
     { "name" : [ "15" ], "length" : 47442429, "type" : "autosome" },
     { "name" : [ "16" ], "length" : 58780683, "type" : "autosome" },
     { "name" : [ "17" ], "length" : 53984731, "type" : "autosome" },
     { "name" : [ "18" ], "length" : 49877488, "type" : "autosome" },
     { "name" : [ "19" ], "length" : 50254551, "type" : "autosome" },
     { "name" : [ "20" ], "length" : 55952140, "type" : "autosome" },
     { "name" : [ "21" ], "length" : 44544065, "type" : "autosome" },
     { "name" : [ "22" ], "length" : 42261000, "type" : "autosome" },
     { "name" : [ "23" ], "length" : 46386876, "type" : "autosome" },
     { "name" : [ "24" ], "length" : 43947580, "type" : "autosome" },
     { "name" : [ "25" ], "length" : 38499472, "type" : "autosome" },
     { "name" : [ "Un" ], "length" : 55413200, "type" : "autosome", "visible" : "data" },
     { "name" : [ "MT" ], "length" : 16596, "type" : "mitochondrial", "visible" : "data" }
  ]
}

If the chromosome names for your species and build are not the standard 1, 2, 3, etc. then you will want to include alias names along with the standard names for that chromosome. For example if your species labeled chromosome 1 as ch01 then you can include it as an alias as follows:

{ "name" : [ "1", "ch01" ], "length" : 60348388, "type" : "autosome" },

NOTE: For computing index and coverage information for BAM files loaded into a GenomeBrowse window, SVS must be able to identify the corresponding reference sequence to be used in the computation. SVS uses matching between the BAM header information and the Genome Assembly files that are saved locally to your machine for this purpose. The chromosome names and lengths must match exactly between the two for the correct reference sequence to be identified.

  • Following each length entry should be a chromosome type designation. Options for these entries include autosome, allosome, and mitochondrial.
  • The last optional entry for each segment is to only show certain segments in a full genome view if there is data listed in that location by adding a visible entry. Options for these entries include always, never and data. If nothing is listed, the default choice to always show that region in a genome-wide zoom is assumed.

Save the file, then close and reopen SVS. Now you should be able to use this assembly within a GenomeBrowse window in SVS by selecting Danio rerio (Zebrafish), Zv9 (Jul 2010) from the genome build dropdown menu.

2. Building Annotation Sources

We will now use the SVS Convert Sources Wizard.

  • Open SVS and go to Tools > Manage Data Sources to open the Data Source Library.
_images/dSL.png
Figure 2-1: Data Source Library
  • Click the Convert… button on the bottom left of the dialog to open the wizard.
_images/convertSource.png
Figure 2-2: Convert Source Wizard

NOTE: Full documentation on this new tool can be found in the SVS Manual or by selecting the Help button on the dialog.

A. Creating a Reference Sequence

An allele reference sequence source can be built for any species where there is an available DNA sequence (FASTA) file.

Download the available FASTA file for the Zv9 assembly from the Ensembl FTP site.

  • Step 1: Click the Add button on the Define Input page of the Convert dialog navigate to the downloaded FASTA file and select the *.fa.gz file. Then click Next >.
  • Step 2: The converter will scan the file to come up with a list of the chromosomes (or scaffolds) that are included in the FASTA and determine the length of each segment. It will also attempt to match the information found to an existing assembly file.
  • Step 3: If a genome assembly match was found the next Change Options screen will show it in the Genome Assembly (Build): drop-down box. For this data we have already created the assembly file but the chromosome names in the FASTA file do not yet match.
  • We will need to rename the segments using the option at the bottom of the dialog before it will correctly match to the Danio rerio Zv9 assembly.
  • To rename select RegExp from the drop-down and type (.*) dna(.*) in the first box and \1 in the second. It should look like Figure 2-3.
_images/segmentMatch.png
Figure 2-3: Assembly match by renaming segments
  • If you scroll down the segment list you will start to see some additional segments that were not included in the assembly file (unmapped scaffolds). In this case we do not want to include them in the reference sequence so right-click on the Use column header and select Uncheck Unmapped then click Next >.

NOTE: SVS has an upper limit of 5000 segments that can be included. The wizard will scan all the available segments in the FASTA file but only allow the longest 5000 to be selected for inclusion in the reference sequence source.

NOTE: If no match is determined for an existing assembly file, you can have the wizard create a new assembly based off the segments and lengths determined by the FASTA data. You will just need to select <Create New> from the genome build drop-down and fill in the required build information.

  • The next window is for labeling the data source and documenting the conversion process. At minimum, you will want to select an informative Name: for the source, then Click Next >

NOTE: For data sources curated by Golden Helix, we will fully document the source of the data, including any citations that are required by the provider. See Figure 2-4 for an example.

_images/document.png
Figure 2-4
  • Step 4: For the last window you can select a location to save the created source. By default, your SVS User Annotation Folder will be selected.
  • Click Convert to create the reference sequence.

B. Creating a Gene Annotation

A gene annotation track can be built for any species where there is an available gene annotation file. Supported file formats are Delimited Text, GTF, or GFF.

Download the available GTF file for the Zv9 assembly from the Ensembl FTP site.

  • Step 1: Click the Add button on the Define Input page of the Convert dialog, navigate to the downloaded GTF file and select the *.gtf.gz file. Then click Next >.
  • Step 2: The converter will scan the file to come up with a list of the chromosomes (or scaffolds) that can be used to match the information found to an existing assembly file.
  • Step 3: The first screen will be a listing of the fields found in the file along with their type. You can select which fields to include in the track and change the type if necessary. For this set we will leave the default options (Figure 2-5) and then click Next >.
_images/outputFields.png
Figure 2-5: Plot Type and Output Options Window
  • If a genome assembly match was found the next Change Options screen will show it in the Genome Assembly (Build): drop-down box. For this dataset it should match to the correct Zebrafish assembly we have built. There are still a number of unmapped scaffolds we will not include in the track.
  • Right-click on the Use column and select Uncheck Unmapped and click Next >.
  • On the next screen fill in any documentation for the track (Figure 2-6) and click Next >
_images/document2.png
Figure 2-6: Gene Track Documentation

Step 4: For the last window select a location to save the created source. An additional feature that is available with gene annotation sources is the ability to index certain fields. The indexing makes searching for those values in the GenomeBrowse plot window much faster. In this case leave the default Gene Name and Transcript Name fields to be indexed (Figure 2-7) and click Convert.

_images/indexFields.png
Figure 2-7: Index Field Options

C. Visualizing the Annotation Sources

Now that the tracks have been created, they can be used in SVS for analysis or just for visualization.

  • Open a new GenomeBrowse window by going to Tools > New GenomeBrowse Window
  • Select the Danio rerio (Zebrafish), Zv9(Jul 2010) assembly from the genome assembly drop-down menu, then click Add
  • Select both of the created sources Ensembl Genes 74, Ensembl and Reference Sequence Zv9, Ensembl and then click Plot & Close
  • You can zoom into different features, or type in any Zebrafish gene name to jump to that location. For example type GCNT7 in the location bar to automatically zoom into this region (Figure 2-8).
_images/geneView.png
Figure 2-8: GCNT7 Gene View
  • If you hover your mouse over Exon 1 of the gene and scroll up, you can zoom in and see the proteins that make up the exon of the gene annotation source, as well as the nucleotides that make up the reference sequence at that location.

3. Create a ‘Fake’ Genome Assembly for any Grouping Variable

Let’s say we have a phenotypic dataset with 500 samples and columns for the subject’s age, the state in which they live, and their weight. Of course we know that weight typically increases with age and also some states have a higher prevalence of obesity. Because of this we may want to plot the weight variable in the genome browser and separate the data by state, similar to separating by chromosome in a genotypic dataset.

To continue, you will need the dataset downloaded at the beginning of this tutorial. From an open project go to Import > Text. Browse to the download location and select the obesity.csv file. Then click Open, leave the default settings and click OK.

  • Now, open the obesity Dataset – Sheet 1.
  • Choose File > Create Marker Map from Spreadsheet.
  • For Select marker name column: choose Row Labels (sampleID), for Select chromosome column: choose state, and for the Select position column: choose age. Enter US State Map as the New Marker Map Name.
  • Your window should match Figure 3-1. Click Next>.
  • Leave weight: in the Create Marker Map Parameters – Step Two window checked and click Create to create the marker map.
Figure 2. Create marker map from spreadsheet window
Figure 3-1: Create marker map from spreadsheet window

Now you need to apply the map. From the obesity Dataset – Sheet 1 spreadsheet choose File > Apply Genetic Marker Map and choose the one we just created. Make sure that you select Row labels under Marker Names Are at the bottom of the window and click OK (see Figure 3-2). Another dialog window will pop up allowing us to enable default marker map fields. For now, leave all three as checked and click OK. A new mapped sheet will be created.

Figure 3. Select A Genetic Marker Map Dialog Window
Figure 3-2: Select A Genetic Marker Map Dialog Window

Next, close the spreadsheets, and from the Project Navigator choose Tools > Manage Genome Assemblies, or press Ctrl-G. In the Manage Genome Assemblies window, select From Marker Mapped Spreadsheet… Choose the obesity Dataset – Mapped Sheet 1 and click OK. For the Name:, type US State ‘Genome’ and for the Build:, type States in the first field and US States in the second (see Figure 3-3). Click OK and a new genome assembly will be created. Close the Manage Genome Assemblies window.

Figure 4. Specify Genome Assembly Build Information
Figure 3-3: Specify Genome Assembly Build Information

Now, let us see how this was useful…

  • Open the obesity Dataset – Mapped Sheet 1.
  • Right-click on the weight column header and choose Plot Variable in GenomeBrowse.
  • At the top of the GenomeBrowse window, you should see a drop-down menu that currently says Homo sapiens (Human), GRCh37hg19 (Feb 2009).
  • Click the arrow on the right side of the box and find the genome assembly that we just created States, US States. Now your data should be visible in the plot viewer.
  • To separate by ‘chromosome’ or in this case State, click on the weight node in the Plot Tree and on the Display tab of the Controls window select Chromosome under the Style By: drop-down.
Figure 5. Plot of weight grouped by state
Figure 3-4: Plot of weight grouped by state

The resulting plot should look like Figure 3-4. If you also want to be able to see the values for each data point you can open a Feature List for the plot by right-clicking anywhere on the graph and selecting Feature List.

Figure 6. Plot with Feature List
Figure 3-5: Plot with Feature List
Updated on March 22, 2021

Was this article helpful?

Related Articles

Leave a Comment