The description of the practical is based on the IGV tutorial of Alex H. Wagner

1 Introduction

1.1 Description of the lab

Welcome to the lab for Genome Visualization! This lab will introduce you to the Integrative Genomics Viewer, one of the most popular visualization tools for High Throughput Sequencing (HTS) data.

After this lab, you will be able to:

  • Visualize a variety of genomic data
  • Quickly navigate around the genome
  • Visualize read alignments
  • Validate SNP/SNV calls and structural re-arrangements by eye

Things to know before you start: * The lab may take between 1-2 hours, depending on your familiarity with genome browsing. Do not worry if you do not complete the lab. It will remain available to review later.

  • There are a few thought-provoking Questions or Notes pertaining to sections of the lab. These are optional, and may take more time, but are meant to help you better understand the visualizations you are seeing. These questions will be denoted by boxes, as follows: Question(s):
Thought-provoking question goes here

1.2 Compatibility

This tutorial was intended for IGV v2.8.0, which is available on the IGV Download page. It is strongly recommended that you use this version.

1.3 Data Set for IGV

We will be using publicly available Illumina sequence data from the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer. Additional information on this cell line can be found here: HCC1143 (tumor, TNM stage IIA, grade 3, primary ductal carcinoma) and HCC1143/BL (matched normal EBV transformed lymphoblast cell line).

Create a new directory for this practical (working directory) in your home directory to the files of this practical (i.e 02_OMICS_IGV).

You will need the two input files above.

Copy the input bam file – that contains the region Chromosome 21: 19,000,000-20,000,000 of sequence read alignments generated from a cell line HCC1143 – from /shared/02_OMICS to your working directory.

Download the bai index file from the web to your working dir. using the wget command.

2 Visualization Part 1: Getting familiar with IGV

2.1 Get familiar with the interface

2.1.1 Load a Genome and some Data Tracks

By default, IGV loads Human hg19 genome version. If you work with another version of the human genome, or another organism altogether, you can change the genome by clicking the drop down menu in the upper-left. For this lab, we will be using Human hg19.

We will also load additional tracks from Server using (File -> Load from Server...):

  • Ensembl Genes: Available Datasets -> Annotations -> Genes -> Ensembl Genes
  • GC Percentage: Available Datasets -> Annotations -> Sequence and Regulation -> GC Percentage
  • dbSNP 1.4.7: Available Datasets -> Annotations -> Variation and Repeats -> dbSNP 1.4.7

Load additional data tracks Load additional data tracks

2.2 Region Lists

Sometimes, it is really useful to save where you are, or to load regions of interest. For this purpose, there is a Region Navigator in IGV. To access it, click Regions -> Region Navigator. While you browse around the genome, you can save some bookmarks by pressing the Add button at any time.

Bookmarks in IGV
Bookmarks in IGV

Regions of interest will be indicated with a red line at the navigation bar below the chromosome.

2.3 Loading Read Alignments

We will be using the breast cancer cell line HCC1143 to visualize alignments. For speed, only a small portion of chr21 will be loaded (19M:20M).

HCC1143 Alignments to hg19:

Load the read alignment file from your working directory to IGV: File -> Load from File..., select the bam file, and click OK. Note that the bam and bai index files must be in the same directory for IGV to load these properly.

Load bam track from File Load bam track from File

2.4 Visualizing read alignments

Navigate to a narrow window on chromosome 21: chr21:19,480,041-19,480,386.

To start our exploration, right click on the track-name, and select the following options: * Sort alignments by start location * Group alignments by pair orientation

Experiment with the various settings by right clicking the read alignment track (left bar: HCC1143.normal.21.19M-20M.bam) and toggling the options.

Changing how read alignments are sorted, grouped, and colored
Changing how read alignments are sorted, grouped, and colored

You will see reads represented by grey or white bars stacked on top of each other, where they were aligned to the reference genome. The reads are pointed to indicate their orientation (i.e. the strand on which they are mapped). Mouse over any read and notice that a lot of information is available. To toggle read display from hover to click, select the yellow box and change the setting.

Changing how read information is shown (i.e. on hover, click, never) Changing how read information is shown (i.e. on hover, click, never)

Once you select a read, you will learn what many of these metrics mean, and how to use them to assess the quality of your datasets. At each base that the read sequence mismatches the reference, the colour of the base represents the letter that exists in the read (using the same colour legend used for displaying the reference).

Viewing read information for a single aligned read Viewing read information for a single aligned read

3 Visualization Part 2: Inspecting SNPs, SNVs, and SVs

In this section we will be looking in detail at 8 positions in the genome, and determining whether they represent real events or artifacts.

3.1 Two neighbouring SNPs

  • Navigate to region chr21:19,479,237-19,479,814
  • Note two heterozygous variants, one corresponds to a known dbSNP (rs982274: G/T on the right) the other does not (C/T on the left)
  • Zoom in and center on the C/T SNV on the left, sort by base (window chr21:19,479,321 is the SNV position)
  • Sort alignments by base
  • Color alignments by read strand

Example1. Good quality SNVs/SNPs Example1. Good quality SNVs/SNPs

Notes:

  • High base qualities in all reads except one (where the alt allele is the last base of the read)
  • Good mapping quality of reads, no strand bias, allele frequency consistent with heterozygous mutation

Question:

1 What does "Shade base by quality" do? How might this be helpful?

3.2 Homopolymer region with indel

The most abundant simple sequence repeat tracts are the homopolymer repeats poly(dA).poly(dT) and poly(dG).poly(dC). Long (> 9 bp) tracts of both types are found at higher than expected frequencies in the non-coding regions of eukaryote genomes. Homopolymer tracts, for example, can serve as protein binding signals, particularly as upstream promoter elements.

Navigate to position chr21:19,518,412-19,518,497

Example 2a

  • Group alignments by read strand
  • Center on the A within the homopolymer region (chr21:19,518,470), and Sort alignments by -> base
Example 2a

Example 2a

Question:

2 How would you better show that the poor base qualities are mostly found on the forward strand?

Example 2b

  • Center on the one base deletion (chr21:19,518,452)
  • Group alignments by -> none
  • Sort alignments by -> base
Example 2b

Example 2b

Notes:

  • The region contains misalignments caused by repeats
  • The alternative allele is either a deletion or insertion of one or two Ts
  • The remaining bases are mismatched, because the alignment is now out of sync

3.3 Coverage by GC

Navigate to position chr21:19,611,925-19,631,555. Note that the range contains areas where coverage drops to zero in a few places.

Example 3

  • Use Collapsed view
  • Use Color alignments by -> insert size and pair orientation
  • Take a look on GC track
  • See concordance of coverage with GC content
Example 3

Example 3

Note:

GC content influences the succes of PCR amplification, therefore could influence coverage.

Question:

3 Why are there blue and red reads throughout the alignments?

3.4 Low mapping quality

Navigate to region chr21:19,800,320-19,818,162

  • Load repeat track (File -> Load from server...)

Load repeats Load repeats

Example 4 Example 4

Notes:

  • Mapping quality plunges in all reads (white instead of grey). Once we load repeat elements, we see that there are two LINE elements that cause this. LINEs are repetitive elements in the genome.

3.5 Homozygous deletion

Navigate to region chr21:19,324,469-19,331,468

Example 5

  • Turn on View as Pairs and Expanded view
  • Use Color alignments by -> insert size and pair orientation
  • Sort alignments by -> insert size
  • Click on a red read pair to pull up information on alignments
Example 5

Example 5

Questions:

4 What is the insert size of a red colored readpair? 
5 How long is the corresponding deletion if the insert size of other reads are 350bp?

3.6 Translocation

Navigate to region chr21:19,089,694-19,095,362

Example 6

  • Expanded view
  • Group alignments by -> pair orientation
  • Color alignments by -> insert size and pair orientation
Example 6

Example 6

Notes:

  • Many reads with mismatches to reference
  • Read pairs in righ-left pattern (instead of left-right pattern)
  • Region is flanked by reads with poor mapping quality (white instead of grey)
  • Presence of reads with pairs on other chromosomes (coloured reads at the bottom when scrolling down)

4 Save the current session