Mapping and counting RNA-seq reads with Rsubread R Bioconductor package

Toy example

Sample: small sample of reads (56,168 read-pairs) from a single replicate Sequencing: strand specific, paired-end, Illumina, 100-100nts Reference: small chromosome fragment (1-839990 nt part of the 3L chromosome of Drosophila simulans contains 96 genes)

Mapping

Call Rsubread library

library(Rsubread)

Indexing reference genome

buildindex(basename="Genome_sequence_ToyExample_index", 
           reference="Genome_sequence_ToyExample.fa")

## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 1.20.6
## 
## //=========================== indexBuilder setting ===========================\\
## ||                                                                            ||
## ||                Index name : Genome_sequence_ToyExample_index               ||
## ||               Index space : base-space                                     ||
## ||                    Memory : 8000 Mbytes                                    ||
## ||          Repeat threshold : 100 repeats                                    ||
## ||  Distance to next subread : 3                                              ||
## ||                                                                            ||
## ||               Input files : 1 file in total                                ||
## ||                             o Genome_sequence_ToyExample.fa                ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Check the integrity of provided reference sequences ...                    ||
## || No format issues were found                                                ||
## || Scan uninformative subreads in reference sequences ...                     ||
## ||    8%,   0 mins elapsed, rate=358.3k bps/s, total=0m                       ||
## ||   16%,   0 mins elapsed, rate=499.0k bps/s, total=0m                       ||
## ||   24%,   0 mins elapsed, rate=593.7k bps/s, total=0m                       ||
## ||   33%,   0 mins elapsed, rate=667.0k bps/s, total=0m                       ||
## ||   41%,   0 mins elapsed, rate=729.3k bps/s, total=0m                       ||
## ||   49%,   0 mins elapsed, rate=786.5k bps/s, total=0m                       ||
## ||   58%,   0 mins elapsed, rate=844.6k bps/s, total=0m                       ||
## ||   66%,   0 mins elapsed, rate=894.2k bps/s, total=0m                       ||
## ||   74%,   0 mins elapsed, rate=941.3k bps/s, total=0m                       ||
## ||   83%,   0 mins elapsed, rate=985.4k bps/s, total=0m                       ||
## ||   91%,   0 mins elapsed, rate=1032.9k bps/s, total=0m                      ||
## ||   99%,   0 mins elapsed, rate=1073.4k bps/s, total=0m                      ||
## || 1 uninformative subreads were found.                                       ||
## || These subreads were excluded from index building.                          ||
## || Build the index...                                                         ||
## ||    8%,   0 mins elapsed, rate=399.2k bps/s, total=0m                       ||
## ||   16%,   0 mins elapsed, rate=429.9k bps/s, total=0m                       ||
## ||   24%,   0 mins elapsed, rate=445.0k bps/s, total=0m                       ||
## ||   33%,   0 mins elapsed, rate=458.9k bps/s, total=0m                       ||
## ||   41%,   0 mins elapsed, rate=472.0k bps/s, total=0m                       ||
## ||   49%,   0 mins elapsed, rate=485.2k bps/s, total=0m                       ||
## ||   58%,   0 mins elapsed, rate=498.5k bps/s, total=0m                       ||
## ||   66%,   0 mins elapsed, rate=511.8k bps/s, total=0m                       ||
## ||   74%,   0 mins elapsed, rate=524.4k bps/s, total=0m                       ||
## ||   83%,   0 mins elapsed, rate=537.4k bps/s, total=0m                       ||
## ||   91%,   0 mins elapsed, rate=550.5k bps/s, total=0m                       ||
## ||   99%,   0 mins elapsed, rate=563.4k bps/s, total=0m                       ||
## || Save current index block...                                                ||
## ||  [ 0.0% finished ]                                                         ||
## ||  [ 10.0% finished ]                                                        ||
## ||  [ 20.0% finished ]                                                        ||
## ||  [ 30.0% finished ]                                                        ||
## ||  [ 40.0% finished ]                                                        ||
## ||  [ 50.0% finished ]                                                        ||
## ||  [ 60.0% finished ]                                                        ||
## ||  [ 70.0% finished ]                                                        ||
## ||  [ 80.0% finished ]                                                        ||
## ||  [ 90.0% finished ]                                                        ||
## ||  [ 100.0% finished ]                                                       ||
## ||                                                                            ||
## ||                      Total running time: 0.0 minutes.                      ||
## ||       Index Genome_sequence_ToyExample_index was successfully built!       ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//

An index needs to be built before read mapping can be performed. This function creates a hash table for the reference genome, which can then be used by Subread and Subjunc aligners for read alignment. Highly repetitive subreads (or uninformative subreads) are excluded from the hash table so as to reduce mapping ambiguity. (page 7 at Rsubread.pdf)

Created files: *_index.00.b.array *_index.00.b.tab *_index.files *_index.log *_index.reads

Mapping reads

subjunc(index="Genome_sequence_ToyExample_index", 
        readfile1="Trimmed_reads_ToyExample_1.fq", readfile2="Trimmed_reads_ToyExample_2.fq", 
        input_format="FASTQ", output_format="BAM", output_file="Mapped_reads_ToyExample.BAM", 
        nthreads=1, phredOffset=64, unique=TRUE, minFragLength=50, maxFragLength=10000, 
        PE_orientation="fr")

## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 1.20.6
## 
## //============================= subjunc setting ==============================\\
## ||                                                                            ||
## || Function      : Read alignment + Junction detection (RNA-Seq)              ||
## || Input file 1  : Trimmed_reads_ToyExample_1.fq                              ||
## || Input file 2  : Trimmed_reads_ToyExample_2.fq                              ||
## || Output file   : Mapped_reads_ToyExample.BAM (BAM)                          ||
## || Index name    : Genome_sequence_ToyExample_index                           ||
## ||                                                                            ||
## ||                       Threads : 1                                          ||
## ||                  Phred offset : 64                                         ||
## ||       # of extracted subreads : 14                                         ||
## ||                Min read1 vote : 1                                          ||
## ||                Min read2 vote : 1                                          ||
## ||             Max fragment size : 10000                                      ||
## ||             Min fragment size : 50                                         ||
## ||    Maximum allowed mismatches : 3                                          ||
## ||   Maximum allowed indel bases : 5                                          ||
## || # of best alignments reported : 1                                          ||
## ||                Unique mapping : yes                                        ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
## 
## //====================== Running (03-May-2018 17:36:59) ======================\\
## ||                                                                            ||
## || The input file contains base space reads.                                  ||
## || The range of Phred scores observed in the data is [5,41]                   ||
## || Load the 1-th index block...                                               ||
## ||   11% completed, 0.1 mins elapsed, rate=12.0k fragments per second         ||
## ||   18% completed, 0.1 mins elapsed, rate=12.0k fragments per second         ||
## ||   25% completed, 0.1 mins elapsed, rate=12.0k fragments per second         ||
## ||   31% completed, 0.1 mins elapsed, rate=12.0k fragments per second         ||
## ||   38% completed, 0.1 mins elapsed, rate=12.0k fragments per second         ||
## ||   45% completed, 0.1 mins elapsed, rate=12.0k fragments per second         ||
## ||   51% completed, 0.1 mins elapsed, rate=12.0k fragments per second         ||
## ||   58% completed, 0.1 mins elapsed, rate=12.0k fragments per second         ||
## ||   65% completed, 0.1 mins elapsed, rate=12.0k fragments per second         ||
## ||   69% completed, 0.1 mins elapsed, rate=5.8k fragments per second          ||
## ||   73% completed, 0.1 mins elapsed, rate=5.8k fragments per second          ||
## ||   76% completed, 0.1 mins elapsed, rate=5.8k fragments per second          ||
## ||   79% completed, 0.1 mins elapsed, rate=5.8k fragments per second          ||
## ||   83% completed, 0.1 mins elapsed, rate=5.8k fragments per second          ||
## ||   86% completed, 0.1 mins elapsed, rate=5.8k fragments per second          ||
## ||   89% completed, 0.1 mins elapsed, rate=5.8k fragments per second          ||
## ||   93% completed, 0.1 mins elapsed, rate=5.8k fragments per second          ||
## ||   96% completed, 0.2 mins elapsed, rate=5.8k fragments per second          ||
## ||                                                                            ||
## ||                          Completed successfully.                           ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================================= Summary ==================================\\
## ||                                                                            ||
## ||          Processed : 56,168 fragments                                      ||
## ||             Mapped : 54,470 fragments (97.0%)                              ||
## ||   Correctly paired : 51,659 fragments                                      ||
## ||          Junctions : 334                                                   ||
## ||             Indels : 327                                                   ||
## ||                                                                            ||
## ||       Running time : 0.2 minutes                                           ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//

Subjunc perform global alignments. The seed-and-vote paradigm enables efficient and accurate alignments to be carried out. phredOffse: sanger 33; illumina 64 PE_orientation: character string giving the orientation of the two reads from the same pair. It has three possible values including fr, ff and rf. Letter f denotes the forward strand and letter r the reverse strand. fr by default (ie. the first read in the pair is on the forward strand and the second read on the reverse strand). (page 2 at Rsubread.pdf)

OPTIONAL - View the mapped reads with Integrative Genomics Viewer (IGV)

For IGV the mapped reads needs to be sorted - by coordinates - and indexed.

Indexing BAM file with Rsamtools

Sorting the BAM file by coordinates.

library(Rsamtools, quietly=T)

sortBam(file="Mapped_reads_ToyExample.BAM", destination="Sorted_Mapped_reads_ToyExample")

## [1] "Sorted_Mapped_reads_ToyExample.bam"

Creating an index file for the sorted BAM file.

indexBam("Sorted_Mapped_reads_ToyExample.bam")

##       Sorted_Mapped_reads_ToyExample.bam 
## "Sorted_Mapped_reads_ToyExample.bam.bai"

IGV

FASTA (genome sequence) Genomes -> Load Genome from File: Genome_sequence_ToyExample.fa

GTF (genome annotation) Files -> Load from File: Genome_annotation_ToyExample.gtf

Sorted, indexed BAM (mapped read-pairs) Files -> Load from File: Sorted_Mapped_reads_ToyExample.bam

Counting

Creating an index file for the sorted BAM file.

Counts_ToyExample=featureCounts(files="Mapped_reads_ToyExample.BAM", 
                                annot.ext="Genome_annotation_ToyExample.gtf", 
                                isGTFAnnotationFile=T, GTF.featureType="exon", 
                                GTF.attrType="gene_id", useMetaFeatures=T, 
                                isPairedEnd=T, requireBothEndsMapped=T, 
                                checkFragLength=F, nthreads=1, strandSpecific=2, 
                                reportReads=T)

## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 1.20.6
## 
## //========================== featureCounts setting ===========================\\
## ||                                                                            ||
## ||             Input files : 1 BAM file                                       ||
## ||                           P Mapped_reads_ToyExample.BAM                    ||
## ||                                                                            ||
## ||             Output file : ./.Rsubread_featureCounts_pid25920               ||
## ||                 Summary : ./.Rsubread_featureCounts_pid25920.summary       ||
## ||              Annotation : Genome_annotation_ToyExample.gtf (GTF)           ||
## ||      Assignment details : <input_file>.featureCounts                       ||
## ||                                                                            ||
## ||                 Threads : 1                                                ||
## ||                   Level : meta-feature level                               ||
## ||              Paired-end : yes                                              ||
## ||         Strand specific : reversely stranded                               ||
## ||      Multimapping reads : not counted                                      ||
## || Multi-overlapping reads : not counted                                      ||
## ||                                                                            ||
## ||          Chimeric reads : counted                                          ||
## ||        Both ends mapped : required                                         ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Load annotation file Genome_annotation_ToyExample.gtf ...                  ||
## ||    Features : 1446                                                         ||
## ||    Meta-features : 96                                                      ||
## ||    Chromosomes/contigs : 1                                                 ||
## ||                                                                            ||
## || Process BAM file Mapped_reads_ToyExample.BAM...                            ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Total fragments : 56168                                                 ||
## ||    Successfully assigned fragments : 47888 (85.3%)                         ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## ||                         Read assignment finished.                          ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//

This function assigns mapped sequencing reads to genomic features.

GTF.featureType: a character string giving the feature type used to select rows in the GTF annotation which will be used for read summarization. exon by default.

GTF.attrType: a character string giving the attribute type in the GTF annotation which will be used to group features (eg. exons) into meta-features (eg. genes). gene_id by default.

useMetaFeatures: logical indicating whether the read summarization should be performed at the eature level (eg. exons) or meta-feature level (eg genes). If TRUE, features in the annotation (each row is a feature) will be grouped into meta-features using their values using the “gene_id” attribute in the GTF-format annotation file, and reads will assiged to the meta-features instead of the features.

requireBothEndsMapped: logical indicating if both ends from the same fragment are required to be successfully aligned before the fragment can be assigned to a feature or metafeature.

checkFragLength: logical indicating if the two ends from the same fragment are required to satisify the fragment length criteria before the fragment can be assigned to a feature or meta-feature. The fragment length criteria are specified via minFragLength and maxFragLength.

strandSpecific integer indicating if strand-specific read counting should be performed. It has three possible values: 0 (unstranded), 1 (stranded) and 2 (reversely stranded).

reportReads: logical indicating if read counting result for each read/fragment is saved to a file. If TRUE, read counting results for reads/fragments will be saved to a tab-delimited file that contains four columns including name of read/fragment, status (assigned or the reason if not assigned), name of target feature/meta-feature and number of hits if the read/fragment is counted multiple times. Name of the file is the same as name of the input read file except a suffix .featureCounts is added. Multiple files will be generated if there is more than one input read file. (page 12 at Rsubread.pdf)

`featureCounts` variable

Description: page 16 at Rsubread.pdf

Names of objects of the featureCounts variable.

names(Counts_ToyExample)

## [1] "counts"     "annotation" "targets"    "stat"

Counts of the first genes.

head(Counts_ToyExample$counts)

##             Mapped_reads_ToyExample.BAM
## FBgn0264707                          60
## FBgn0001316                        1150
## FBgn0035104                          11
## FBgn0035132                          27
## FBgn0262679                          40
## FBgn0035112                         234

Some data about the counts.

Counts_ToyExample$stat

##                       Status Mapped_reads_ToyExample.BAM
## 1                   Assigned                       47888
## 2       Unassigned_Ambiguity                         114
## 3    Unassigned_MultiMapping                           0
## 4      Unassigned_NoFeatures                        3736
## 5        Unassigned_Unmapped                        4430
## 6  Unassigned_MappingQuality                           0
## 7  Unassigned_FragmentLength                           0
## 8         Unassigned_Chimera                           0
## 9       Unassigned_Secondary                           0
## 10    Unassigned_Nonjunction                           0
## 11      Unassigned_Duplicate                           0

stat: a data frame giving numbers of unassigned reads and the reasons why they are not assigned (eg. ambiguity, multi-mapping, secondary alignment, mapping quality, fragment length, chimera, read duplicate, non-junction and so on), in addition to the number of successfully assigned reads for each library.

Summary statistics of the counted reads.

summary(Counts_ToyExample$counts)

##  Mapped_reads_ToyExample.BAM
##  Min.   :   0.00            
##  1st Qu.:  25.75            
##  Median :  97.50            
##  Mean   : 498.83            
##  3rd Qu.: 486.50            
##  Max.   :5665.00

Histogram of the counts.

hist(Counts_ToyExample$counts)

hist(log2(Counts_ToyExample$counts))

Write the count table to a file.

write.table(Counts_ToyExample$counts, "Counts_ToyExample.tsv", quote=F, sep="\t")

RNA-seq I. - Mapping contigs

David Fazekas

03/05/2018

Prepare

Install R packages from CRAN

Bioconductor

Mapping and counting RNA-seq reads with Rsubread R Bioconductor package

Toy example

Mapping

Indexing reference genome

Mapping reads

OPTIONAL - View the mapped reads with Integrative Genomics Viewer (IGV)

IGV

Counting

`featureCounts` variable

RNA-seq I. - Mapping contigs

David Fazekas

03/05/2018

Prepare

Install R packages from CRAN

Bioconductor

Mapping and counting RNA-seq reads with Rsubread R Bioconductor package

Toy example

Mapping

Indexing reference genome

Mapping reads

OPTIONAL - View the mapped reads with Integrative Genomics Viewer (IGV)

IGV

Counting

featureCounts variable

`featureCounts` variable